Using ColdFusion Structures To Remove Duplicate List Values

By Ben Nadel

Published 2006-12-13 in ColdFusion — Comments (47)

The other week, I posted a solution for Ray Camden's Friday Puzzler for Santa's list. It involved using the ColdFusion structure (struct) as a utility to clean lists. I thought it worked quite nicely and thought that I would talk about it here quickly. The beauty of the ColdFusion struct is that it is an object that maps keys to values. This mapping feature forces the struct to only have unique keys as there would be no way to map two duplicate keys to two different values. This mapping can be leveraged to clean the list of duplicate values.

To demonstrate this, let's build a list that has duplicate values:

<!--- Create a group values that have duplicate values. --->
<cfsavecontent variable="strListData">
	Ghost Busters
	What About Bob
	Broken Flowers
	What About Bob
	Lost In Translation
	Stripes
	Scrooge
	What About Bob
	Lost In Translation
</cfsavecontent>


<!--- Create a list based on the list data. --->
<cfset lstMovies = strListData.Trim().ReplaceAll(
	"[\r\n\t]+",
	","
	) />

This is a bit of a round-about way to create a list, but I am doing this for readability (so you can really see the list). If we output the list above:

<!--- Break list on to different lines. --->
#lstMovies.ReplaceAll( ",", "<br />" )#

... we get:

Ghost Busters
What About Bob
Broken Flowers
What About Bob
Lost In Translation
Stripes
Scrooge
What About Bob
Lost In Translation

Now, we can loop over the list and add them to the ColdFusion struct. Remember, due to the mapping nature of the struct, duplicate keys will override each other, thereby removing duplicates:

<!--- Create a structure to hold the movie titles. --->
<cfset objMovies = StructNew() />

<!--- Loop over the list to add mappings to the struct. --->
<cfloop index="strMovie" list="#lstMovies#" delimiters=",">

	<!---
		Store key/value pair. In this case, we don't
		really care about the value (hence storing ""
		as value). We only care about the key.
	--->
	<cfset objMovies[ strMovie ] = "" />

</cfloop>

<!--- Convert the now unique key values into a list. --->
<cfset lstMovies = StructKeyList( objMovies ) />

Now, outputting that list:

<!--- Break list on to different lines. --->
#lstMovies.ReplaceAll( ",", "<br />" )#

... we get:

Lost In Translation
What About Bob
Stripes
Scrooge
Ghost Busters
Broken Flowers

Notice that this time, the duplicates have been removed. Notice also that the order of the list has been changed. That usually won't matter as I don't think you should require proper ordering in a list that is required to remove duplicate values.

I am not sure how this compares (performance-wise) to the loop/check strategy that constantly checks for duplicates. I almost want to say that this would be faster as there really is no logic involved. There's no CFIF statements, no conditionals; all we do is set values and then join them together. It's that easy.

One final note: this is not case sensitive. The structure, like most all ColdFusion, treats mixed cases as the same value. If you want to do this type of algorithm with case sensitivity, you can create a Java 2 Hashtable and use it in similar way. I won't go into that though, as that is not something any of us really need to do (or otherwise, we probably shouldn't be using lists perhaps?).

Want to use code from this post? Check out the license.

Short link: https://bennadel.com/432

Reader Comments

Dan Wilson Dec 13, 2006 at 9:09 AM

36 Comments

Very Clean!

There are other ways to remove duplicates in a list, but none as elegant as this.

Good Show!

Dan Wilson

Ben Nadel Dec 13, 2006 at 9:24 AM

15,996 Comments

Dan,

Glad you like. Always happy to share.

Dan Roberts Dec 13, 2006 at 10:05 AM

27 Comments

Thanks! I've been looking for a way to speed up the creation of a unique 5 character id list I create for emails tracking in one of our products. The best way I could find to keep the list unique was just using string contains on a list but your struct alternative is MUCH faster. I'm still checking my code but here are some figures:

2,000 IDs
- List/String = 1797ms
- Struct = 171ms

10,000 IDs
- List/String = 41750ms
- Struct = 937ms

Ben Nadel Dec 13, 2006 at 10:10 AM

15,996 Comments

Daniel,

That's awesome. I had a gut feeling tell me it would be faster, but it's nice to see some numbers on it.

Dan Roberts Dec 13, 2006 at 10:22 AM

27 Comments

100,000 IDs
- List/String (not going to do it now, tried in the past and its like 20 minutes)
- Struct = 50266ms (WOW!)

I have it flushing counts every 100 ID attempts and it pauses around 36800 and 74900 every time so there is some type of conversion going on around those sizes. I noticed similar pauses occurring when using strings.

The algorithm is close to linear instead of growing in time based on the size of the list. Don't remember all that Big O notation stuff to explain that further.

dc Dec 13, 2006 at 2:34 PM

1 Comments

I'm no regexp ninja but would you not be better just using regular expressions?

nothing at hand at the moment to show for code, but thought i'd chuck that in the ring

Ben Nadel Dec 13, 2006 at 2:38 PM

15,996 Comments

DC,

Regular expressions are totally awesome, but they require a lot of parsing and processing. And, the larger the string, the longer it takes in what seems a non-linear fashion... that's not from official testing, that's just what I have noticed.

That, and I wouldn't even know how to start writing that expression.

Peter Bell Dec 13, 2006 at 5:13 PM

111 Comments

Hey Ben,

VERY cool! Very cool indeed!

Ben Nadel Dec 13, 2006 at 5:35 PM

15,996 Comments

Pete,

Glad you like. One thing to note as well is that you can do StructKeyArray() as well if you need to clean an array rather than a list. This will just return the struct-keys into a one-dimensional array.

Tim Jul 27, 2007 at 12:25 AM

10 Comments

Now how about an easy way to tell *which* elements of the list are duplicated?

Dan Roberts Jul 27, 2007 at 12:33 PM

27 Comments

The regular expression idea interests me. I'm curious about the efficiency of checking for duplicates across a list in a single regular expression search. Not as efficient as structs for deduping but if you have to find dupes it may be faster than looping for every item in list.

First you would need to sort the list, then run a regular expression such as the following "(,|$)([^,]+),\2(,|^)".

Just remembered though that REFind only returns the first match so that would require some looping to find the next match. hmmm

Ben Nadel Jul 27, 2007 at 2:04 PM

15,996 Comments

@Dan,

Regular expression are totally awesome... but you want to avoid having to parse a string too many times. I would guess that any kind of looping in conjunction with regex is gonna be slower than using arrays and structs.

Dan Roberts Jul 27, 2007 at 2:43 PM

27 Comments

I'm not actually advocating a loop through each piece of data, rather an REFind on an entire list string using the delimiter in regular expression to detect dupes. The benefit with a regular expression in this instance over an array loop is that you are only looping as many times as there are matches and you are able to start where you left off.

For example with the regular expression, if there are no duplicates you would do a single REFind which would find no duplicates and you would be done. If it found one you would start another REFind from where you found the first dupe to look for the next.

A loop over each item in the list or array on the other hand would require an action on ever item in the list to compare to every other item in the list. That quickly looses efficiency because processing increases exponentially, which was the need for the struct work around of course.

Anyways, something to think about and I might play with when I have time.

Ben Nadel Jul 28, 2007 at 5:25 PM

15,996 Comments

@Dan,

I this is interesting. I would certainly love to see what you come up with if you have time play around with the idea. I am always looking to learn new techniques.

Matt Feb 27, 2008 at 8:30 PM

2 Comments

Though the RE method may help us find duplicates be it tell us they exist or tell us which are the duplicates the original purpose of this post was to REMOVE duplicates not to find them.

For it's purpose this works well enough in comparison to the alternatives out there which serve ONLY this purpose and no extra.

As a thought should you find your data in a Query object (say you pulled a list from you database and you only want the unique entries for a particular column where the db does not have this constraint imposed) a QofQ may work and scale far better than this method. Where the method here is linear (O(n)) for those that know the notation it is possible the QoQ method may be < O(n) for larger datasets.

Jas Kaur Apr 23, 2009 at 6:35 AM

1 Comments

Absolutley love this, have used the technique a few times!

rich May 13, 2009 at 2:30 PM

3 Comments

How about a way to count the occurrences of each value in the list, before or during the conversion to a struct key list?

Jose Galdamez May 14, 2009 at 3:18 PM

50 Comments

This is a great approach! Thanks for laying out the logic so easily and cleanly. I remember relying on a UDF from cflib.org that also removed duplicates. This one would loop through each list item and add it to a second list only if it hadn't been seen before. As the second list gets larger then searching gets slower. The method shown here would definitely be faster as others have already pointed out. Shoot, Adobe might as well take the code here and make a built-in function out of it. I can dream, can't I?

rich May 15, 2009 at 11:23 AM

3 Comments

I wrote this to count the occurrences of each item when looping a populated list with redundancies into a struct. Worked pretty well. Thanks for the original idea!! :-)

Ben Nadel May 19, 2009 at 9:30 AM

15,996 Comments

@Rich,

Looks good - always glad to provide any inspiration. As a quick note, you don't need to use the # sign so much. It is only needed when a variable is being evaluated *inside* a string.

James Brown Aug 25, 2009 at 11:49 PM

12 Comments

Ben,

Great post! Used it on a couple different projects in the past week! Thanks!!

Jody Fitzpatrick Nov 17, 2009 at 4:09 AM

19 Comments

This is something that is making me pull my hair out, and still is.

I would like to order the results based on the number of occurrences, if a result or particular ID shows up more than once it would have a higher position in the array or list. Then the ones that have less would follow behind.

This is really driving me bonkers.

Jose Galdamez Nov 17, 2009 at 10:47 AM

50 Comments

@Jody

To the best of my knowledge, structures don't really have a notion of ordering, so you would have to convert the structure to some other data type such as 2D array or a query.

If you wanted to go the query route, you first convert your structure to a query and then run a query of queries to run your sort.

<cfscript>
struct = { one = 1, two = 2, three = 3, four = 4, five = 5 };
query = queryNew( "key,num" );
for ( key in struct ){
queryAddRow( query );
querySetCell( query, "key", key );
querySetCell( query, "num", struct[key] );
}
</cfscript>

<cfquery name="qSort" dbtype="query">
SELECT *
FROM query
ORDER BY num desc
</cfquery>

I got my inspiration for this workaround from the following tutorials.

http://www.coldfusioncookbook.com/entry/23/How-do-you-loop-over-the-values-in-a-structure?

http://www.coldfusioncookbook.com/entry/87/How-do-I-sort-a-2-dimensional-array?

If you wanted to go the 2D array route, you could use the code from the latter of the two links above.

Matt Nov 17, 2009 at 11:04 AM

2 Comments

I don't think that's what Jody was exactly looking for but at least it's close.

If there's an array with duplicates I believe Jody wants them ordered from most to least? Obviously the easiest way is likely:

1. Create the 'unique key' structure defined above with value = 0
2. Loop the array incrementing a counter each time a value is seen
3. Iterate the new structure creating again ANOTHER structure where the keys are the numerical counters stored in struct1. When you have 2 keys with the same counter make the value a list or an array of the appropriate text keys.

You can of course use a QofQ method:

<cfscript>
array = ListToArray( "one,two,three,one,one,three,two,four,two,two,four,four,four" );
counter = ArrayNew(1);
ArraySet( counter, 1, ArrayLen( array ), 1 );
query = QueryNew( '' );
QueryAddColumn( query, 'key', array );
QueryAddColumn( query, 'counter', counter );
</cfscript>

<cfquery name="sorted" dbtype="query">
SELECT key, sum(counter) AS total FROM query
GROUP BY key ORDER BY total DESC
</cfquery>

You now have a query of with the keys and the number of occurrences for each key. Do with this as you like.

Jose Galdamez Nov 17, 2009 at 11:39 AM

50 Comments

@Matt

My solution was based off an earlier comment where Rich showed some code on how to create the unique key struct and maintain a count of how many times each duplicate was seen. Either way, looks like there's a couple of ways to get the same result.

Jody Fitzpatrick Nov 18, 2009 at 3:48 AM

19 Comments

Thanks Guys!

After hundreds of hairs being pulled out I finally figured it out.

I just have a few more things to do and my project will be 95.5 percent done.

Once again thanks for all the help here was my solution...

<cfset convertResults_toList = ArrayToList(results_scrub_url) >


<cfset finalResults_List = "">




<cfloop from="1" to="#arraylen(results_scrub_url)#" index="curRes">


<cfloop from="1" to="#arraylen(results_scrub_url)#" index="checkRes">


<cfif results_scrub_url[curRes] eq results_scrub_url[checkRes] AND NOT curRes eq checkRes >

<cfif NOT listContains(finalResults_List, #curRes#) >

<cfset finalResults_List = ListAppend(finalResults_List, "#curRes#") >
</cfif>

</cfif>

</cfloop>

</cfloop>

Jody Fitzpatrick Nov 18, 2009 at 4:09 AM

19 Comments

Well Never Mind!

-- Ugh, I thought I finished it but it turns out its still kicking duplicates. It's going to be a simple fix. I will let you guys know.

Jose Galdamez Nov 21, 2009 at 12:28 PM

50 Comments

@Jody

I can't seem to get your code sample to work. If you are still having problems, try this code out and see if it gets you what you wanted.

<cfset listWithDups = "second,second,second,third,third,first,first,first,first,fourth" />

<cfset query = queryNew("val") />
<cfloop list="#listWithDups#" index="i">
<cfset queryAddRow(query) />
<cfset querySetCell(query, "val", i) />
</cfloop>

<cfquery name="sortedQuery" dbtype="query">
SELECT val, count(*) as numOcurrences
FROM query
GROUP BY val
ORDER BY numOcurrences desc, val asc
</cfquery>

<cfdump var="#sortedQuery#" />

<cfset finalList = valueList(sortedQuery.val) />

<cfdump var="#finalList#" />

Jody Fitzpatrick Nov 21, 2009 at 5:15 PM

19 Comments

@Jose Galdamez,

Oh heh yeah I didn't paste the whole code. I should have defined the vars -- my bad. It's fixed thou. Thanks.

Paul Gillard Mar 1, 2010 at 8:43 AM

1 Comments

How about http://www.iknowkungfoo.com/blog/index.cfm/2008/10/22/Remove-duplicate-list-or-array-elements-using-ColdFusion-and-Java which recommends using Java's LinkedHashSet for this purpose

Dmitry May 21, 2010 at 10:09 AM

8 Comments

Hello Ben.

No good :(
Gave me a "500 server internal error" on CF8 when playing this trick with list consisting of 1 element [37077015].
In all other cases it worked like a charm.
Think I'll have to try Java method instead :(

Ben Nadel May 21, 2010 at 9:58 PM

15,996 Comments

@Dmitry,

Hmm, I am not sure why this would error out with only one list item. I can't see anything that would cause this require two or more list items. What was the exception that is actually raised?

@Paul,

Using the hash set in Java seems like a really cool approach. I believe that under the hood, ColdFusion's struct is actually extending a Map class (FastHashMap if I remember). Looks like the Set class tree is the more appropriate approach. Thanks for the link!

Dmitry May 24, 2010 at 4:15 AM

8 Comments

I believe the reason for the exception was not the list length but it's items. 37077015 is a big numeric value. Making a structure key from it might cause some problems.
Exception was 500 internal server error and then many encoded characters like "‹ ÅZ{sâ8 ÿ;SµßAÇ" :(

jyoseph Dec 29, 2010 at 2:09 PM

20 Comments

Here I am again, 4 years after you created this blog entry and it just helped me today! Thanks for all of your work for the Coldfusion community.

Ben Nadel Dec 29, 2010 at 5:21 PM

15,996 Comments

@Jyoseph,

Ha ha, awesome that old posts can keep on delivering value :) Rock on!

Evik James Jun 2, 2011 at 11:40 AM

10 Comments

Ben, I used your method of removing duplicates again. I did it using CFSCRIPT this time. I figured I'd share it.

SecurityStruct = structNew();
for (i = 1; i lte listLen(SecurityList); i++) {
SecurityStruct[listGetAt(SecurityList, i)] = "";
}
SecurityList = StructKeyList(SecurityStruct);

Ben Nadel Jun 3, 2011 at 10:57 PM

15,996 Comments

@Evik,

Ha ha, awesome :) Are you doing stuff mostly in script these days?

Marty Franklin Sep 2, 2011 at 3:16 PM

1 Comments

Ben

This is just awesome! Great thinking outside of the box. Its powerful stuff like this that kicks the poo out of .Net

Thanks again!

Anna Sep 2, 2011 at 3:26 PM

383 Comments

Thanks for sharing. Nice, clean solution.

Kate Sep 30, 2011 at 1:49 PM

23 Comments

You are amazing. Thank you for all that you do!

George Fallar Nov 18, 2011 at 6:35 PM

1 Comments

Okay you just saved me a lot of time with an elegant solution. AGAIN!

Thanks!

dcs Mar 22, 2013 at 9:31 AM

25 Comments

Late to the party but figured I would add my two cents anyway.

How about a way to count the occurrences of each value in the list, before or during the conversion to a struct key list?

I would imagine that one would use listValueCount() (or listValueCountNoCase()), e.g.:

<cfset objMovies[ strMovie ] = listValueCount(lstMovies, strMovie) />

Not having actually tested it I'm not sure what it would add to the processing time.

Nina Oct 1, 2013 at 1:49 PM

2 Comments

Hi Ben. I ran across this post trying to find a solution to my problem of cleaning duplicates from a 2D array. I can't convert the array to a list because it is not 1D. Do you have any suggestions on how to do this for 2D arrays?
Thanks.

Jose Galdamez Oct 1, 2013 at 2:44 PM

50 Comments

@Nina,

That's certainly an interesting one to solve. Does your 2D array have a consistent width for each row? What happens to the duplicate values? Do you set the cells to null?

Nina Oct 1, 2013 at 5:11 PM

2 Comments

@Jose,

No, the widths are not consistent for each row. Right now I am not doing anything with the duplicate values, they're just sitting there being duplicates (lol). I don't know how to get rid of them in the array is the reason.
The array is getting duplicate values because I have 2 different queries that are populating the array, each of which may be returning the same value, thus the duplicate.
I think I am just going to have to figure out a way to combine the queries, or maybe a way to make it a 1D array, get rid of dups, then go get the other data I need using the array value.

Jose Galdamez Oct 1, 2013 at 6:03 PM

50 Comments

@Nina,

I'm wondering if something like this would work. What do you think?

If this is too inefficient maybe there's a way of doing a UNION on the queries and a GROUP BY on top of that UNION.

<cfscript>
var duplicateStruct = {};
var key = "";
 
// Loop through outer array
for (rowNum = 1; rowNum <= arrayLen(twoDArray); rowNum++) {
	// Loop through inner array
	for (colNum = 1; colNum <= arrayLen(twoDArray[rowNum]); colNum++) {
		key = twoDArray[rowNum][colNum];
 
		if (!structKeyExists(duplicateStruct, key)) {
			// First time seeing this key
			duplicateStruct[key] = 1;
		} else {
			// Keep count of how many times duplicate appears
			duplicateStruct[key]++;
 
			// Set array cell to null/undefined
			twoDArray[rowNum][colNum] = javaCast("null", 0);
		}
	}
}
</cfscript>

Sam Daams Jan 8, 2014 at 4:08 AM

10 Comments

For those who end up here using Google and scroll all the way down, if you are using CF10, you can now use listRemoveDuplicates() to get a list without duplicates.

Oh my chickens, this post is old!

Hit me up on Twitter if you want to discuss it further.