Parsing CSV Values Using A Standard File Format
I have been working on parsing CSV values in ColdFusion and it has given me such a headache. The hardest part is trying to figure out if you are dealing with a field qualifier (ex. "ben") or dealing with an escaped qualifier (ex. "Ben is the ""bomb"""). I came up with a hacky solution that iterates over each character in the line. This is not performant and also does not handle all the qualifier situations correctly.
Tony Petruzzi suggested that I try using a Tokenizer. I had originally tried this in my first attempt but ran into similar problems involving field qualifiers. Working on it again this morning, I realized that my biggest problem was that I didn't even know what the CSV standard file format was! How can I solve a problem when I don't even know what my problem domain is?!?
After doing some quick Googling, I found this page, http://www.edoceo.com/utilis/csv-file-format.php, which listed the CSV file format standard as:
- Each record is one line - Line separator may be LF (0x0A) or CRLF (0x0D0A), a line seperator may also be embedded in the data (making a record more than one line but still acceptable).
- Fields are separated with commas. - Duh.
- Leading and trailing whitespace is ignored - Unless the field is delimited with double-quotes in that case the whitespace is preserved.
- Embedded commas - Field must be delimited with double-quotes.
- Embedded double-quotes - Embedded double-quote characters must be doubled, and the field must be delimited with double-quotes.
- Embedded line-breaks - Fields must be surrounded by double-quotes.
- Always Delimiting - Fields may always be delimited with double quotes, the delimiters will be parsed and discarded by the reading applications.
This makes things SOOO much easier. Knowing that an embedded delimiter or qualifier MUST be in a field that is fully qualified simplifies my life so much. Now, if I come across a token like this:
""""
I don't have to think about wether or not it's two escaped qualifiers or one escaped qualifier in a fully qualified field. Based on the standard I know that embedded qualifiers MUST be in a qualified field and hence, this is a single escaped qualifier in a qualified field (NOT two escaped qualifiers).
Based on this new information, I should be able to have the String Token version of the ColdFusion CSV parser up and running soon. But let this be a lesson - if you are trying to solve a problem, be sure you truly understand what the problem is :)
Thinking about it now though, why bother using the Tokenizer? That involves function calls. Why not just convert the row to a list using a split method. Looping over an Array has got to be faster than using a Tokenizer.
Reader Comments
I've used the following two free tools when I needed to read CSV values. They should be helpful for anyone who wants code that does it for you or for anyone who is writing their own code and wants to see an example.
1. CSV Library (open source, CFC)
http://labs.redbd.net/projects/csvLibrary.cfm
2. cfx_text2Query (free, CFX)
http://www.emerle.net/programming/display.cfm/t/cfx_text2query
Gabriel,
That looks pretty cool. Giving the algorithm a once over, it appears to be going through each character at a time. I would like to accomplish this without having to examine each character, but that may just be the best way to go.
Thanks for posting these links. I am sure they are going to be very useful. I can't use the CFX one (no install permissions), but it is good to know it is there.
I've started doing everything on the database server side using DTS packages- makes life a whole lot easier. Although if you are using something other than SQL Server, I guess you have to go that route.
Bob, good point... but if for no other reason, it is fun to figure this stuff out.
Dealing with dirty data can always throw a kink into things. If it's a one off data import, sometimes it's just easier to open the file first in Excel, expand out the columns and look for the dirty row.
For continual usage, using the list functions are great, and when you have a known column count, you can use listlen for a partial validity check and then jump to some cleanup functions if the length doesn't match the expected result. Just make sure that you don't have any empty columns.
Try <cfset myList = ",,foo,bar">
<cfoutput>#ListLen(myList)#</cfoutput>
and you'll see what I mean.
Chris,
Yeah, dirty data is something I fear, but I suppose, this, as with anything else, if you don't use it properly, it's ok if it breaks (it's not wrong to break). As far as interface type stuff, I am hoping to add this to my POI Utility.
Blank list items can be pain, especially in ColdFusion. Luckily, it seems that when you have a string and use the underlying java method String::Split( regEx ), the regular expression "[,]{1}" will split ",,," into four empty values. This works quite nicely!
I have a home-grown CFC that I use for parsing CSV values (though I turn them into a 2D array rather than a query), and I recently retooled it to use the string.Split(chr(10)) and string.split(",") methods, i.e.
(psuedo-code)
<cfset var i = 0>
<cfset var j = 0>
<cfset destArray = inText.split(chr(10))>
<cfloop from="1" to="#arrayLen( destArray )#" index="i">
<cfset destArray[i] = destArray[i].split(",")>
<cfloop from="1" to="#arrayLen( destArray[i] )#" index="j">
<cfif left(destArray[i][j],1) eq """">
<cfif j lt arrayLen( destArray[i] )>
<cfset destArray[i][j] = listAppend( destArray[i][j], destArray[i][j+1]>
<cfelse>
<cfset arrayAppendArray( destArray[i], destArray[i+1] )>
</cfif>
</cfif>
</cfloop>
</cfloop>
An interesting thing I noticed though, is that Sring.split() at least in CFMX6.1 doesn't return a real ColdFusion array, so all of the "ArrayAPpend" type functions don't work on it. I had to write a custom function to convert the Java Array into a ColdFusion array.
However, net result: About a 90% performance savings over the old character-by-character method I'd been using before. Now if I could just find a way to speed up my bulk inserts...
I have a home-grown CFC that I use for parsing CSV values (though I turn them into a 2D array rather than a query), and I recently retooled it to use the string.Split(chr(10)) and string.split(",") methods, i.e.
(incomplete code)
<cfset var i = 0>
<cfset var j = 0>
<cfset destArray = inText.split(chr(10))>
<cfloop from="1" to="#arrayLen( destArray )#" index="i">
<cfset destArray[i] = destArray[i].split(",")>
<cfloop from="1" to="#arrayLen( destArray[i] )#" index="j">
<cfif left(destArray[i][j],1) eq """">
<cfif j lt arrayLen( destArray[i] )>
<cfset destArray[i][j] = listAppend( destArray[i][j], destArray[i][j+1]>
<cfelse>
<cfset arrayAppendArray( destArray[i], destArray[i+1] )>
</cfif>
</cfif>
</cfloop>
</cfloop>
An interesting thing I noticed though, is that Sring.split() at least in CFMX6.1 doesn't return a real ColdFusion array, so all of the "ArrayAPpend" type functions don't work on it. I had to write a custom function to convert the Java Array into a ColdFusion array.
However, net result: About a 90% performance savings over the old character-by-character method I'd been using before. Now if I could just find a way to speed up my bulk inserts...
Assuming you're doing bulk inserts into a SQL server database, I've been able to speed that process up by inserting 100 rows using one database call in the format below.
INSERT INTO memberTable(membername,memberLastName)
SELECT 'John','Smith'
UNION ALL
SELECT 'Julia','Roberts'
UNION ALL
SELECT 'Bruce','Wayne'
100 was a trivial number for me. At 1000, the inserts took longer than if I did them 1 at a time. But at 100, the process took significantly less time than if I inserted them 1 at a time.
I wasn't looking for the optimal number. So 100 shortened the time it took to an acceptable time for me.
Gabriel
Adam,
Yes, I also ran into the Split() not returning a true ColdFusion array in this sort of algorithm. I just finished an algorithm for handling text based only on delimiters. My next step is have CSV parsing that handles delimiters and line breaks. The tricky thing is that line breaks might be escaped... actually, now that I am writing this, it is occurring to me that these are EXACTLY the same problems. I can use the same string tokenizer to break it up by lines and then by delimiters in two passes.
Hello!
I came across an issue with cfx_text2Query and I wonder if anybody has an idea for my problem. It's about the character set because I need to use german umlauts. ;./
On my windows machine everything works fine importing the csv with the umlauts (the file is ISO-8859-1 as Firefox tells me)
(config: win2k, CF7, and the head of the .cfm reads:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<cfprocessingdirective pageencoding="iso-8859-1">
<HTML>
<HEAD>
...
Now to the problem: When using the same script on the production machine (wich runs under Linux) I get "?" instead of any umlaut. :(
Bloody linux?! Any ideas? :-)
Ronald
@Ronald,
I am always lacking experience with extended characters. Sorry :(
@Ronald
I had the same problem with german umlauts from a csv import file. I set the charset of <cffile> tag to "iso-8859-1" and it worked.
<cffile action="read" file="/usr/local/httpd/htdocs/yoursite/umlauts.csv" variable="csvfile" charset="iso-8859-1">
<cfloop index="index" list="#csvfile#" delimiters="#chr(10)##chr(13)#">
#listgetAt('#index#',1,';')# #listgetAt('#index#',2,';')# #listgetAt('#index#',3,';')#
<br />
</cfloop>