Today, I was working on some data parsing when it hit me like a bolt of lightening! Why not just use Regular Expressions to parse CSV (comma separated values) data files? For those of you who have seen my previous attempts at parsing CSV data files using ColdFusion, you will see that having to handle embedded field qualifiers and data delimiters made it a character by character parsing problem since you can no longer treat it like a list of lists (both comma a new line delimited). It was a huge amount of code to do a huge amount of work.
But then, as I was working today, suddenly I realized that working with embedded field qualifiers was exactly like working with embedded quotes in HTML tag attribute value parsing; it's just a pattern. So, what is the pattern of the CSV file. At it's highest level, it's a field value followed by an optional delimiter. I say "optional" delimiter because the last data field in the file will not be succeeded by a delimiter. The field value is then an optional qualified value containing zero or more characters. The allowable characters in the field are determined be the qualification of the data.
The regular expression I came up with was so short it made me giggle:
("(?:[^"]|"")*"|[^",\r\n]*)(,|\r\n?|\n)?
This ColdFusion compatible regular expression captures two groups; the first group is the field value (zero or more characters) and the second group is the optional succeeding delimiter. If this is totally not readable (as most regular expressions are not), I have converted this to a verbose regular expression (?x) with my typical amount of commenting so that you can really see how the pattern of the data field can be matched:
Launch code in new window » Download code as text file »
So now that you see how simple the pattern is, let's use it to start parsing our CSV data in ColdFusion. Luckily, from within ColdFusion, we have access to the Java Pattern and the Java Pattern Matcher which make our lives almost too easy when it comes to iterating over a string. As we iterate over the string, we are going to store our CSV values in an array of arrays in which each sub array will represent a row from the CSV data file.
Launch code in new window » Download code as text file »
There's a lot of commenting going on there (typical Me!), but if you filter all that out, this is only a few lines of code. That's how easy regular expressions make our lives! Compare that with any of my previous ColdFusion CSV parsing solutions to see that this one is a small fraction of the size. And, best part is, it works! Running the above code, we get the following CFDump output:
| | | | ||
| | ![]() | | ||
| | | |
It works on a small file, sure piece of cake, but what about performance? Sometimes we (ok, maybe just me) get sucked so hard into how sexy Regular Expressions are, that we lose sight of the big picture and we end up going down a path that is less performant in the long run. To test this, I am going to create a beefy CSV file with this code:
Launch code in new window » Download code as text file »
This creates a 10,000 line CSV file with over 2,150,000 characters. This file comes out to be just over 2 megabytes in size. Now, to make sure I am not messing with the time trials, I am running this script separately than the parsing code.
Now for the test! I ran the current version that employs regular expressions to do the parsing against my previous version which checks characters.
Average parse time: 100,000 ms (100 seconds)
Average parse time: 11,000 ms (11 seconds)
Holy Cow! The regular expression solution is about 10 times faster than the previous solution! And, this was no rinky-dinky example; this was a seriously large CSV file (at least in my world).
Just for fun, I regenerated the above test CSV file, except this time, I created a 50,000 record CSV file. This file has 10,750,000 characters in it and comes out at well over 10 megabytes. I didn't even want to bother running this in the old solution. However, I am very pleased to say that all 50,000 records were successfully parsed in just 57,000 ms (57 seconds)! That's pretty sweet-ass-sweet.
Let's stop for a second and just look at the performance scaling. The first run went 10,000 records in 11,000 ms. That comes out to be 909 records per second. The second run went 50,000 records in 57,000 ms. That comes out to be 877 records per second. As the data set grows, this thing is just gonna EAT UP MEMORY (creating this massive array in RAM), but as you can see, the efficiency of the algorithm is fairly constant. This is some awesome news.
| | | | ||
| | ![]() | | ||
| | | |
Download Code Snippet ZIP File
Comments (18) | Post Comment | Ask Ben | Permalink | Other Searches | Print Page
ColdFusion 8 ImageDrawTextArea() (Inspired By Barney Boisvert!)
Java String Buffer Treated As String In ColdFusion (When Needed)
Holy Pseudo-Random Strings Of Characters, Batman! :)
Firstoff, very cool.
Secondly, I have a question that will wither mean I'm starting to understand Regular Expressions or... well, not.
It looks to me like you've got two groupings that create unused backreferences, namely ("(?:[^"]|"")*"|[^",\r\n]*) and (,|\r\n?|\n) If I understand this correctly, wouldn't (?:"(?:[^"]|"")*"|[^",\r\n]*)(?:,|\r\n?|\n) give better performance because you're not creating unused backreferences?
Or am I missing something?
Posted by Matt Osbun on Sep 28, 2007 at 8:42 AM
@Matt,
True, the groups do create back references that can use within the current regular expression. However, it makes the matched pattern much more convenient when grabbing the matched text when you capture the values.
If none of the groups were capturing, I could still grab the whole matching using:
objMatcher.Group()
However, it would not be as easy to figure out where the field value ends and the !optional! delimiter begins. By allowing the field and the delimiter to be captured via a group, then it makes it much more convenient to use:
objMatcher.Group( 1 )
... to get the field value, and
objMatcher.Group( 2 )
... to get the optional delimiter.
So yes, it might be faster in the pattern to not capture, but overall, capturing the groups make the algorithm easier to code and to understand.
Posted by Ben Nadel on Sep 28, 2007 at 9:04 AM
:)
Posted by tony petruzzi on Sep 28, 2007 at 9:06 AM
@Ben
Gotcha. Thanks.
Posted by Matt Osbun on Sep 28, 2007 at 9:19 AM
@Tony,
I love those comics. I need to get that RegEx t-shirt: http://xkcd.com/store/
Posted by Ben Nadel on Sep 28, 2007 at 10:41 AM
Here, we use the java Ostermiller CSV parser
http://ostermiller.org/utils/CSV.html
On my test server, your code running at around 4000 - 5000ms
Ostermiller's is under 300ms
We do A LOT of data importing and exporting, so I've spent tons and tons of time looking for the best way to parse files.
The only CF I have to deal with is to call a function:
<cfscript>
function parseCSV(csvPath) {
jFIS = createObject( "java", "java.io.FileInputStream" );
jFIS.init("#ARGUMENTS.csvPath#");
csvParser = createObject( "java", "com.Ostermiller.util.CSVParser" );
csvParser.init( jFIS );
csvAsArray = csvParser.getAllValues();
csvParser.close();
jFIS.close();
return csvAsArray;
}
</cfscript>
<cfset csvArray = parseCSV(strCSVPath)>
Posted by Jason Rushton on Sep 28, 2007 at 10:57 AM
@Jason,
Yeah, a real Java package is probably going to outperform anything that I write. But keep in mind that mine runs basic ColdFusion without any additions. I assume that for the CSVParser to work, I would have to add a Java package to the ColdFusion install. This might not be a problem for most people, but I do not have access to do that sort of stuff.
But regardless, thanks for testing it up against the "big boys". Good to see where I fall in line :)
Posted by Ben Nadel on Sep 28, 2007 at 11:15 AM
Email me your size and address and it will be ordered tonight :)
Posted by tony petruzzi on Sep 28, 2007 at 8:23 PM
Hello Chaps,
I do a fair amount, well, in fact, I do a large amount of CSV data parsing through a bunch of web services and things like that, and this is a very interesting topic for me.
Untill now I've pretty much cut CF out of the loop entirely as I dont really need to do any real 'manipulation' on the data, I just want to get it parsed into my database, so in the past I've always used a DTS package for SQL Server to do this for me, and more recently now I'm running 2k5 just opted for a stored proc that recieves a string for the file path of the CSV, it then does a BULK LOAD on the file, which seems to work very nicely.
Am I still handling this in the best fasion? or would one of these JAVA classes mixed up with an insert type query be my best bet?
I'm still betting that SQL handles this stuff infinatly faster than CF.
Rob
Posted by Rob Rawlins on Sep 29, 2007 at 1:58 PM
Ben, since at http://www.bennadel.com/index.cfm?dax=blog:410.view you asked me to see if I had any suggestions for your CSV parser here, in Java I would use something like the following:
\G(,|\r?\n|\r|^)(?:"([^"]*+(?>""[^"]*+)*)"|([^",\r\n]*+))
While the above is no longer ColdFusion-compatible, I assume that's not a big deal since you're using Java's regex engine anyway. The somewhat more advanced features it uses include possessive quantifiers, an atomic group, Jeffrey Friedl's "unrolling the loop" pattern, and Java's "\G" (just as an extra assurance that the regex's bump-along mechanism always puts us where we want to be). I've also used two different sets of parentheses to avoid having to remove the outer quotes from quoted values in post-processing. Instead, we can just check if a value was captured to backreference 2, and if so, use it, otherwise use backreference 3.
One other thing I would recommend after a brief lookover is changing the replacement of (""){2} with $1 to simply replacing "" with ". Also, if you're going to use a regex to do that, compile the regex outside the loop.
Posted by Steven Levithan on Sep 29, 2007 at 3:28 PM
@Steve,
You magnificent bastard - your optimizations rock. It saves 40% execution time! Unbelievable!
http://www.bennadel.com/index.cfm?dax=blog:978.view
Thanks for taking your time to help me out.
Posted by Ben Nadel on Sep 29, 2007 at 6:06 PM
Dude - You're a star.. Lovin' this code. I added The following code to convert the array to a query:-
<!--- Convert to a query --->
<cfset columnList = arrayToList(arrData[1])>
<cfset qryNew = queryNew(columnList)>
<cfloop from="2" to="#arrayLen(arrData)#" index="ptr">
<cfset itemArray = arrData[ptr]>
<cfset queryAddRow(qryNew)>
<cfloop from="1" to="#arrayLen(itemArray)#" index="ptr2">
<cfset querySetCell(qryNew, "#listGetAt(columnList, ptr2)#", itemArray[ptr2], qryNew.recordcount)>
</cfloop>
</cfloop>
Posted by Martin parry on Oct 15, 2007 at 6:51 AM
@Martin,
Glad you like it. Just watch out with your array to query algorithm as the array is not guaranteed to have all the sub-arrays the same length. Just be careful of the query dimensions.
Posted by Ben Nadel on Oct 15, 2007 at 7:12 AM
Ben,
Thank you for forwarding me to this article. Works even faster then the last and with less code!!!
one quick question, why can't i assign vars to each row in the cfloop output? Usually the following works...
Can you throw in an example how i can set a var to each row so that i can insert into the db? Thanks in advance
This isn't working:
<cfloop from="1" to="#arrayLen(arrData)#" index="i">
<cfoutput>#arrData[i]#</cfoutput> <br />
</cfloop>
Posted by Rob G on Sep 29, 2008 at 8:49 PM
@Rob,
Not to keep bumping you around, but I actually took this algorithm and wrapped it up in a user defined function (UDF):
http://www.bennadel.com/index.cfm?dax=blog:991.view
As far as your question, I am not really sure what you mean. Can you explain further? Maybe using the UDF above will help you out a bit.
Posted by Ben Nadel on Sep 29, 2008 at 8:53 PM
Ben,
I downloaded the UDF and the array dump of my CSV is working great. I am just having a problem setting variables to each column in each row in the csv.
For my particular CSV file, I have 19 columns and each column is shown in the dump but what var name do i use to output that data in a loop?
see below. this is the copy/paste of the dump. I'm sure it's easy but how do what syntax do i use to set a var to each row in the array? The normal methods i use are not working right... thanks.
array
1 501
2 9938
3 2002
4 BMW
5 128i
6 Sedan
7 1B7HL48X62S557215
8 0
9 24449
10 22000
11 Red
12 Tan, blue
13 Automatic
14 [empty string]
15 Description 1
16 Car
17 V6
18 Front Wheel Drive
19 Gasoline
Posted by Rob G. on Sep 29, 2008 at 9:12 PM
@Rob,
The result of the UDF is an array of arrays. Therefore, you need to refer to both indexes:
#arrData[ intRowIndex ][ intFieldIndex ]#
I think the problem is that you are only referring to the row index. That gives you another array which requires a second index.
Posted by Ben Nadel on Sep 29, 2008 at 9:30 PM
Yep! That's it. Thanks, i'm used to using structures within arrays and that seemed tricky for me at first
<cfoutput>#arrData[1][1]#</cfoutput>
etc...
Great! I can slee well tonight now :)
You da man.
Posted by Rob G. on Sep 29, 2008 at 9:35 PM