The other day, I put together a regular-expression-based CSV parsing algorithm in ColdFusion. I like to think that I have a pretty good handle on writing regular expressions, but I know that I am not a wizard like Steve Levithan. As such, I asked him to take a look at what I had written to see if it could be optimized. I started off with this:
("(?:[^"]|"")*"|[^",\r\n]*)(,|\r\n?|\n)?
Steve Levithan took a look at that and came back with this:
\G(,|\r?\n|\r|^)(?:"([^"]*+(?>""[^"]*+)*)"|([^",\r\n]*+))
Before I get into how this works (as best as I can even understand), let's talk numbers. When parsing a 10 megabyte file with 10,750,000 characters over 50,000 lines of data, my version of the regular expression ran in about 58 seconds. This is not bad considering that my previous algorithm was horrible.
I then took Steve's regular expression and implemented it and ran it on the same file. This ran, on average, in about 35 seconds. Holy mackerel! That's almost a 40% reduction in execution time thanks to his optimization.
The optimization that Steve put into the expression was not just tied to the regular expression itself, but also to the use of that regular expression in the rest of the code. Where as I grabbed the entire field value, regardless of qualification, Steve broke qualified and non-qualified field values into two different capturing groups. This is actually a huge improvement because it:
Also, while this it not necessarily an optimization, since his pattern does not have an optional delimiter, we can remove the use of the ColdFusion CFBreak tag that you may have seen in my previous implementation. Not having the need to use CFBreak is just more elegant, in my opinion.
Ok, so let's see how this 40% reduction in execution time is achieved. Here it the algorithm and example:
Launch code in new window » Download code as text file »
And, just so believe me that this actually works, here is the CFDump output that we get:
| | | | ||
| | ![]() | | ||
| | | |
So, like I said, the whole idea of possessive qualifiers and atomic grouping is still very fuzzy to me. However, I did try to break Steve's regular expression apart and explain it. Please take this with a grain of salt as it might not be 100% accurate (or helpful):
Launch code in new window » Download code as text file »
Thanks Steve! You rock.
Download Code Snippet ZIP File
Comments (9) | Post Comment | Ask Ben | Permalink | Other Searches | Print Page
Resident Evil: Extinction Starring Milla Jovovich
ColdFusion 8 ImageDrawTextArea() (Inspired By Barney Boisvert!)
Heh... Thanks for the regex luv.
As you noted, although there is significant optimization in the regex itself (through possessive quantifiers/atomic groups and Jeffrey Friedl's "unrolling the loop" pattern, which avoids the inefficiency of something like "(?:[^"]|"")*"), much of the speedup comes from the changes which allow simplified post-processing.
By the way, in your original post you mentioned in the code comments that you weren't sure if it was necessary to allow a single line-feed character (\n) for line breaks. That's a Unix\Linux style line break. Mac uses \r, and Windows uses \r\n.
Posted by Steve on Oct 1, 2007 at 4:58 PM
@Steve,
Thanks for all the tips. I have to look further into this Unrolling The Loop technique as the whole non-backtracking stuff is still a bit fuzzy, especially when it comes to seeing where it should be applied. Do you know of any good explanations of this? Or should I just Google it?
Posted by Ben Nadel on Oct 1, 2007 at 5:07 PM
regular-expressions.info has pretty good information on atomic groups and possessive quantifiers at http://www.regular-expressions.info/possessive.html and http://www.regular-expressions.info/atomic.html
The general pattern for unrolling the loop is:
opening normal* (special normal*)* closing
...although there are a few additional rules, such as that the start of special and normal must never intersect, and special must not match an empty string.
You can see it in the above pattern as:
"([^"]*+(?>""[^"]*+)*)"
Where the leading and trailing double-quote are opening and closing, [^"] is normal, and "" is special.
There is a little more information at http://www.softec.st/en/OpenSource/DevelopersCorner/RegularExpressions/UnrollingTheLoop.html
Posted by Steve on Oct 1, 2007 at 6:47 PM
Check out this article on regex optimization:
http://www.javaworld.com/javaworld/jw-09-2007/jw-09-optimizingregex.html
It is, in part, Java specific, and makes a few debatable claims, but on the whole it is a very good article for the basics of regex optimization (a topic which isn't written about often enough, or at least not by people who know what they're talking about). One day I hope to write up an in-depth article on JavaScript regex optimization.
Posted by Steve on Oct 1, 2007 at 8:02 PM
@Steve,
Yeah, I love http://www.regular-expressions.info. I think it has a ton of great information. Thanks for the link on unrolling the loop. I think I get it (or at least getting there). I'll have to check out that article on Javascript regular expression optimization.
Posted by Ben Nadel on Oct 2, 2007 at 7:32 AM
How would you go about this if the delimeter were a TAB character. I tried adding "|\t" in the line where you have "\G(,|\r?\n|\r|^)", but this didn't seem to work.
Also, can you incorporate this into a function like you did in your original blog post, so the data, delimiter, file, and qualifier can be passed into CSVToArray(). Thanks.
Posted by Troy on Oct 11, 2007 at 1:41 PM
@Troy,
I will certainly wrap this up in a UDF. The \t probably didn't work because you need to reference later on in the code as well as in the Regular Expression.
I'll try to post tomorrow.
Posted by Ben Nadel on Oct 11, 2007 at 2:03 PM
Thanks Ben, looking forward to the new and improved UDF. The speed is incredibly faster and is going to make my client very happy. And if you can make it support the TAB delimiter, all the better.
In the old UDF, a 184KB CSV file with 1000 rows took 35.59 sec to parse.
in the new UDF, the same file is parsed in only 1.57 sec.
Posted by Troy on Oct 11, 2007 at 3:20 PM
@Troy,
The UDF has been posted:
http://www.bennadel.com/index.cfm?dax=blog:991.view
That's awesome to hear that the new algorithm is so much faster. Thanks for the feedback.
Posted by Ben Nadel on Oct 12, 2007 at 9:02 AM