Parsing CSV Data Using ColdFusion
As part of my exploration of writing, reading, and creating Microsoft Excel documents using ColdFusion, I have come across the need to parse comma-separated-value (CSV) data files. While this seems at first like a relatively simple task, I soon found out that it was ANYTHING but simple. It's one thing to worry about lists (for which ColdFusion is the bomb-diggity), but it's an entirely other thing to worry about lists that have field qualifiers, escaped qualifiers, escaped qualifiers that might be empty fields, and non-qualified field values all rolled into one.
I tried looking it up in Google but could not find any satisfactory algorithms (translates to: code that I could understand). Everything on CSV seems to be in Java and none the stuff on CFLib.org seems to comply with the range of CSV values (especially qualified fields). So, in typical blood-and-guts fashion, I sat down and tried to write my own algorithm. This proved to be easy at first until I found out that my approach was highly flawed. I went through about three different implementations over the weekend of the algorithm before I came up with something that seemed to work satisfactorially.
It has to evaluate each character at a time, which probably won't scale or perform nicely. I would have liked to harness the power of CFHttp to convert CSV files to queries, but I could not get CFHttp to work on the LOCAL file system (ie. a URL that begins with "file:"). If anyone knows of great way to do this, please let me know. I suppose that I could written a temporary file to a public folder and then performed a CFHttp to it, then deleted it, but that just felt a bit "hacky." However, in the end that might just prove to be the way to go.
So anyway, this is what I have come up with. It is a function that takes either a chunk of CSV data or a file path to a CSV data file (text file) and converts it to an array of arrays. It assumes that each record is separated by a return character followed optionally by a new line. Not sure if that is cross system compliant, but heck, this is my first attempt:
<cffunction
name="CSVToArray"
access="public"
returntype="array"
output="false"
hint="Takes a delimited text data file or chunk of delimited data and converts it to an array of arrays.">
<!--- Define the arguments. --->
<cfargument
name="CSVData"
type="string"
required="false"
default=""
hint="This is the raw CSV data. This can be used if instead of a file path."
/>
<cfargument
name="CSVFilePath"
type="string"
required="false"
default=""
hint="This is the file path to a CSV data file. This can be used instead of a text data blob."
/>
<cfargument
name="Delimiter"
type="string"
required="false"
default=","
hint="The character that separate fields in the CSV."
/>
<cfargument
name="Qualifier"
type="string"
required="false"
default=""""
hint="The field qualifier used in conjunction with fields that have delimiters (not used as delimiters ex: 1,344,343.00 where [,] is the delimiter)."
/>
<!--- Define the local scope. --->
<cfset var LOCAL = StructNew() />
<!---
Check to see if we are dealing with a file. If we are,
then we will use the data from the file to overwrite
any csv data blob that was passed in.
--->
<cfif (
Len( ARGUMENTS.CSVFilePath ) AND
FileExists( ARGUMENTS.CSVFilePath )
)>
<!---
Read the data file directly into the arguments scope
where it can override the blod data.
--->
<cffile
action="READ"
file="#ARGUMENTS.CSVFilePath#"
variable="ARGUMENTS.CSVData"
/>
</cfif>
<!---
ASSERT: At this point, whether we got the CSV data
passed in as a data blob or we read it in from a
file on the server, we now have our raw CSV data in
the ARGUMENTS.CSVData variable.
--->
<!---
Make sure that we only have a one character delimiter.
I am not going traditional ColdFusion style here and
allowing multiple delimiters. I am trying to keep
it simple.
--->
<cfif NOT Len( ARGUMENTS.Delimiter )>
<!---
Since no delimiter was passed it, use thd default
delimiter which is the comma.
--->
<cfset ARGUMENTS.Delimiter = "," />
<cfelseif (Len( ARGUMENTS.Delimiter ) GT 1)>
<!---
Since multicharacter delimiter was passed, just
grab the first character as the true delimiter.
--->
<cfset ARGUMENTS.Delimiter = Left(
ARGUMENTS.Delimiter,
1
) />
</cfif>
<!---
Make sure that we only have a one character qualifier.
I am not going traditional ColdFusion style here and
allowing multiple qualifiers. I am trying to keep
it simple.
--->
<cfif NOT Len( ARGUMENTS.Qualifier )>
<!---
Since no qualifier was passed it, use thd default
qualifier which is the quote.
--->
<cfset ARGUMENTS.Qualifier = """" />
<cfelseif (Len( ARGUMENTS.Qualifier ) GT 1)>
<!---
Since multicharacter qualifier was passed, just
grab the first character as the true qualifier.
--->
<cfset ARGUMENTS.Qualifier = Left(
ARGUMENTS.Qualifier,
1
) />
</cfif>
<!--- Create an array to handel the rows of data. --->
<cfset LOCAL.Rows = ArrayNew( 1 ) />
<!---
Split the CSV data into rows of raw data. We are going
to assume that each row is delimited by a return and
/ or a new line character.
--->
<cfset LOCAL.RawRows = ARGUMENTS.CSVData.Split(
"\r\n?"
) />
<!--- Loop over the raw rows to parse out the data. --->
<cfloop
index="LOCAL.RowIndex"
from="1"
to="#ArrayLen( LOCAL.RawRows )#"
step="1">
<!--- Create a new array for this row of data. --->
<cfset ArrayAppend( LOCAL.Rows, ArrayNew( 1 ) ) />
<!--- Get the raw data for this row. --->
<cfset LOCAL.RowData = LOCAL.RawRows[ LOCAL.RowIndex ] />
<!---
Replace out the double qualifiers. Two qualifiers in
a row acts as a qualifier literal (OR an empty
field). Replace these with a single character to
make them easier to deal with. This is risky, but I
figure that Chr( 1000 ) is something that no one
is going to use (or is it????).
--->
<cfset LOCAL.RowData = LOCAL.RowData.ReplaceAll(
"[\#ARGUMENTS.Qualifier#]{2}",
Chr( 1000 )
) />
<!--- Create a new string buffer to hold the value. --->
<cfset LOCAL.Value = CreateObject(
"java",
"java.lang.StringBuffer"
).Init()
/>
<!---
Set an initial flag to determine if we are in the
middle of building a value that is contained within
quotes. This will alter the way we handle
delimiters - as delimiters or just character
literals.
--->
<cfset LOCAL.IsInField = false />
<!--- Loop over all the characters in this row. --->
<cfloop
index="LOCAL.CharIndex"
from="1"
to="#LOCAL.RowData.Length()#"
step="1">
<!---
Get the current character. Remember, since Java
is zero-based, we have to subtract one from out
index when getting the character at a
given position.
--->
<cfset LOCAL.ThisChar = LOCAL.RowData.CharAt(
JavaCast( "int", (LOCAL.CharIndex - 1))
) />
<!---
Check to see what character we are dealing with.
We are interested in special characters. If we
are not dealing with special characters, then we
just want to add the char data to the ongoing
value buffer.
--->
<cfif (LOCAL.ThisChar EQ ARGUMENTS.Delimiter)>
<!---
Check to see if we are in the middle of
building a value. If we are, then this is a
character literal, not an actual delimiter.
If we are NOT buildling a value, then this
denotes the end of a value.
--->
<cfif LOCAL.IsInField>
<!--- Append char to current value. --->
<cfset LOCAL.Value.Append(
LOCAL.ThisChar.ToString()
) />
<!---
Check to see if we are dealing with an
empty field. We will know this if the value
in the field is equal to our "escaped"
double field qualifier (see above).
--->
<cfelseif (
(LOCAL.Value.Length() EQ 1) AND
(LOCAL.Value.ToString() EQ Chr( 1000 ))
)>
<!---
We are dealing with an empty field so
just append an empty string directly to
this row data.
--->
<cfset ArrayAppend(
LOCAL.Rows[ LOCAL.RowIndex ],
""
) />
<!---
Start new value buffer for the next
row value.
--->
<cfset LOCAL.Value = CreateObject(
"java",
"java.lang.StringBuffer"
).Init()
/>
<cfelse>
<!---
Since we are not in the middle of
building a value, we have reached the
end of the field. Add the current value
to row array and start a new value.
Be careful that when we add the new
value, we replace out any "escaped"
qualifiers with an actual qualifier
character.
--->
<cfset ArrayAppend(
LOCAL.Rows[ LOCAL.RowIndex ],
LOCAL.Value.ToString().ReplaceAll(
"#Chr( 1000 )#{1}",
ARGUMENTS.Qualifier
)
) />
<!---
Start new value buffer for the next
row value.
--->
<cfset LOCAL.Value = CreateObject(
"java",
"java.lang.StringBuffer"
).Init()
/>
</cfif>
<!---
Check to see if we are dealing with a field
qualifier being used as a literal character.
We just have to be careful that this is NOT
an empty field (double qualifier).
--->
<cfelseif (LOCAL.ThisChar EQ ARGUMENTS.Qualifier)>
<!---
Toggle the field flag. This will signal that
future characters are part of a single value
despite and delimiters that might show up.
--->
<cfset LOCAL.IsInField = (NOT LOCAL.IsInField) />
<!---
We just have a non-special character. Add it
to the current value buffer.
--->
<cfelse>
<cfset LOCAL.Value.Append(
LOCAL.ThisChar.ToString()
) />
</cfif>
<!---
If we have no more characters left then we can't
ignore the current value. We need to add this
value to the row array.
--->
<cfif (LOCAL.CharIndex EQ LOCAL.RowData.Length())>
<!---
Check to see if the current value is equal
to the empty field. If so, then we just
want to add an empty string to the row.
--->
<cfif (
(LOCAL.Value.Length() EQ 1) AND
(LOCAL.Value.ToString() EQ Chr( 1000 ))
)>
<!---
We are dealing with an empty field.
Just add the empty string.
--->
<cfset ArrayAppend(
LOCAL.Rows[ LOCAL.RowIndex ],
""
) />
<cfelse>
<!---
Nothing special about the value. Just
add it to the row data.
--->
<cfset ArrayAppend(
LOCAL.Rows[ LOCAL.RowIndex ],
LOCAL.Value.ToString().ReplaceAll(
"#Chr( 1000 )#{1}",
ARGUMENTS.Qualifier
)
) />
</cfif>
</cfif>
</cfloop>
</cfloop>
<!--- Return the row data. --->
<cfreturn( LOCAL.Rows ) />
</cffunction>
I have chosen to convert the CSV to an array of arrays as I was not sure that you could depend on the constant number of fields per row. Plus, I figure that going from an array to a query (after this step) would be rather easy. Plus, since Excel is not perfectly square cols vs. rows, I figure this was more in-line with where I want to go with it (including it in my ColdFusion POI Utility component).
If I create a variable containing this CSV data:
last name,first name,salary,dream salary,happiness
Jones,Mike,"$35,500.00","$73,000.00"
Hopkins,Paul,"$55,234.00","$250,000.00",3.0
Hawkings,Katie,,,
,
Smith,Betty,"$57,010.00","$60,000.00",10.0
... and pass it into the CSVToArray ColdFusion user defined function:
<!--- Convert the CSV to an array of arrays. --->
<cfset arrCSV = CSVToArray(
CSVData = strCSVData,
Delimiter = ",",
Qualifier = """"
) />
<!--- Dump out array. --->
<cfdump var="#arrCSV#" label="CSV Data" />
I get this output:
As you can see, the CSVToArray() ColdFusion function handles mixed length records, empty field values, and qualified fields. It even handles escaped qualifiers (ex. "" becomes ") but this was not demonstrated. While this is not perfect, at least it provides me with a CSV conversion interface that I can use in my POI Utility ColdFusion component. Further down the road, I will be able to swap this out later for a better implementation.
Want to use code from this post? Check out the license.
Reader Comments
Ben,
I haven't thought through this, so forgive me if it's a stupid question, but
did you consider using regular expressions? If so, what caused you to decide against using them?
@Ben,
When dealing with lists, use the GetToken() function. It won't ignore empty list elements. This will significantly speed up your function and replace the loop that you are doing. Also Sammy hit the nails on the head with using RegEx to strip out the text between the qualifiers.
Another trick you can use to speed things up is to use GetToken() to populate the empty the empty cells and then use ListToArray() for the conversion. It's alot quicker then creating a Java Object on each call.
Hopefully this helps you out some.
@Sammy,
I did think of regular expressions, 'cause they are cool, but I wasn't sure how to apply them. Plus I don't think my skills with them would be good enough to handle all the different options that come with CSV formatting. Take for example:
ben,was,here
That is three fields. But this:
"ben,was,here"
is one field. But this:
""ben,was,here""
is three fields; the first starts with a quote literal, and the last field ends with a quote literal. And then this:
""ben,"was,here"""
has two fields.... you get the point? It was just too much for me to wrap my head around. I am sure that regular expressions would rock somehow, I just can't figure it out.
Tony,
It's funny you mention that because my first attempt actually did use a Tokenizer. In my experience, though, it does skip empty fields:
<cfset Tokenizer = CreateObject(
"java",
"java.util.StringTokenizer"
).Init(
JavaCast( "string", "a,b,,,,c,d,e,f" ),
JavaCast( "string", "," )
) />
<cfloop condition="Tokenizer.HasMoreTokens()">
[#Tokenizer.NextToken()#]<br />
</cfloop>
... outputs:
[a]
[b]
[c]
[d]
[e]
[f]
... it skips right over the empty fields. However, in my current implementation I do add a leading space to all fields which then gets stripped out later.
I did learn some things in iteration three that I didn't know in iteration one, so I could probably go back and apply that to the String Tokenizer. In fact, maybe I will do that.
Comma seperated is a good idea with cold fusion becoz it is gonna remove some of difficult queries and the irregularities. while is is easy to retrieve the information at the client end.
It is being used in www.compglobe.com where you are entitled to compose your comment and the comment will be transfered to the CSV file at the server level.
www.compglobe.com is also using CSV format to upload the phone no.s if you want to send information to the handset of the recipent to whom you want to delivered the material. www.compglobe.com has various things like message composer and an online radio too.
Doing something similar, i just grabbed http://opencsv.sourceforge.net/ and then did this:
<cfparam name="filename">
<cfscript>
fileReader = createobject("java","java.io.FileReader");
fileReader.init(filename);
csvReader = createObject("java","au.com.bytecode.opencsv.CSVReader");
csvReader.init(fileReader);
</cfscript>
<cfdump var="#csvReader.readAll()#">
Java and ColdFusion play SO nice together *smile*
Thanks for the code. This was very helpful since I'm just learning CF. I now from other experiences that parsing CSV files can be a real pain to get it to work right.
Thanks for the code and tutorial Ben - I was grappling with exactly the same issue relating to coverting CSV with encapsulating quotes and your post was a lifesaver!!
Always a pleasure to help out!
@Stephen:
Have you gotten this work with opencsv's CSVWriter?
Tim
This is similar perhaps to what I need to achieve.(I think)
My client has a list of products. (Product ID, Product Name, description) are the colum headers for the product table.
well, the description field data... is a CSV.
for example
the data in the description field is:
OD(+/-1.2mm), Wall Thickness = 5.0mm (+/- .4mm), Inside Diameter = 65.0mm, Approximate pieces per case = 4, Approximate weight per case = 32.34 lbs
But i need to take the data in that one field, and create more colums to display these attributes rather than this text blob.
Am I on the right track?
@JKS,
You can use CSV parsing to get those values; however, if those are the only values in the field, you can simply treat the data as if it were a comma-delimited list. Then, you can either split the list into an array with ListToArray(), or even use things like ListGetAt() and ListLen() to loop over the elements of the list and examine each individually.
Ben,
AWESOME JOB!!! I can't believe this was so difficult to find. You definitely saved HOURS of time and helped meet my deadline. This works great. People like you are what make the net an awesome place for research and learning. Thanks!!
@Rob,
Glad to help. Check out a more updated post on this type of thing:
www.bennadel.com/index.cfm?dax=blog:976.view
Hi Ben,
This is really great code that you are sharing. I am new to coldfusion coding.
I am not able to connect the dots between the array generated using this procedure and writing a query and or tying it into the POI Utility.
I imagine it is strait forward, but I am seeming not to be able to work it through.
Appreciate all your posts.
@Dwayne,
What exactly are you trying to accomplish?
Thanks man! It works great, you saved my day!
Great function Ben, saved me tons of work on a database upload!
@Phil, @Adrian,
Glad to help out fellas. If you like this, be sure to check out an updated version that uses regular expressions rather than character-based parsing.
www.bennadel.com/blog/991-CSVToArray-ColdFusion-UDF-For-Parsing-CSV-Data-Files.htm
It allows for more flexibility in the data in the CSV file.
hi ben. how would you go about creating a function that would read csv data no matter the order of columns - as long as the header fields are named to match my db fields.
i am using your wonderful code above and it works awesome! however, i need to adjust my array data each time i add columns (particularly if i rearrange the column order).
do i first read in the header row, and somehow do the matching there?
this may be extra for expert stuff!
@Mike,
I am typically not a fan of using header rows to do auto-name things (I don't usually trust clients to name things appropriately); but this seems to be something that people always are asking about. I will come up with something that makes this a bit easier to work with. I'll get back to you.
Ben,
Do you have any idea how to accommodate for a multilingual csv? Most (but not all) of the languages/characters pass through fine. It seems that Chinese and Russian are having the most trouble being interpreted. I'm guessing this is an issue with the charset, but I am not positive (nor am I sure as to how I would go about fixing this issue).
Thanks for all of the helpful postings!
Brian
^^Ben, I apologize about my above question. I realized that it was an error on MY end. I should have figured that :p Quick fix by manipulating cfcontent :)
Thanks for all of the helpful posts though!
@Brian,
No problem - I'm glad you got it figured out as I would have been stumped on what to suggest :P
Ben,
THANK YOU! (I'm not yelling, just excited). I've been using Coldfusion longer than I like to think... I never received any training... I just read Ben Forte's book... and I was off. That said, if I hadn't found your code I would have been forced to hack together some nasty bit of code that would have caused me more trouble than good.
Question:
I need to import the data from the array into a database. I know that I can loop through a List but not an array.... any good suggestions how to easily import from an array?
Thanks,
Wayne
Ben,
I tried using your CSV Java based code on a file that used TABS to separate the data but it didn't reliably recognize the tab delimiter.
Do I need to change the pattern that you created to have it work properly?
Thanks,
Wayne
@Wayne,
A few comments back, I actually pointed to a newer version of the CSV parsing function. It uses Regular Expressions to the parsing which turns out to be much faster and more flexible:
www.bennadel.com/blog/991-CSVToArray-ColdFusion-UDF-For-Parsing-CSV-Data-Files.htm
As far as moving them into a database, the UDF returns an array or arrays. To loop over them, you can use an index loop; or, if you are using CF8+, you can use an array loop.
While not directly related, I do have a post that talks about moving XML into a database. This does use a good bit on Array looping to get the job done:
www.bennadel.com/blog/1636-How-To-Move-XML-Data-Into-A-Database-Using-ColdFusion.htm
I hope some of that helps!
Ben,
I've been using your script for some time now but I'm still having trouble with any field that has a ". My file is delimited by TAB with no quotes. I have Qualifier set to "" (nothing) but the routine sees a " it ignores any more tabs in that record (concatenates all the rest of the fields for that one record into the field that had the " in it. Here are my parameters.
<cfset arrData = CSVToArray(
CSVData = strCSVTab,
Delimiter = "#chr(9)#",
Qualifier = ""
) />
Any help would be much appreciated.
Thanks,
Wayne Gregori
If you want an example of uploading a csv (using Ben's csvtoarray) and then looping through each element within an array within an array, here it is:
<!--- Assumes you already processed your form and have the file file... now get your csv file to your server --->
<cffile
action = "UPLOAD"
filefield = "myFile"
destination = "#ExpandPath('myfolder/')#"
nameconflict = "overwrite"
result = "thefile">
<cfset thepath = '#Expandpath('myfolder/')#' & '#thefile.ServerFile#'>
<cfinclude template="csvtoarray.cfm">
<cfset result=csvtoarray(#thepath#)>
<cfoutput>
<cfloop index="OuterLoop" from="1" to="#ArrayLen(Result)#">
<cfloop index="InnerLoop" from="1" to="#ArrayLen(Result[OuterLoop])#">
<cfoutput>
Result[#OuterLoop#][#InnerLoop#] is #Result[OuterLoop][InnerLoop]#<br>
</cfoutput>
</cfloop>
</cfloop>
</cfoutput>
If you want an example of uploading a csv (using Ben's csvtoarray) and then looping through each element within an array within an array, here it is:
<!--- Assumes you already processed your form and have the file file... now get your csv file to your server --->
<cffile
action = "UPLOAD"
filefield = "myFile"
destination = "#ExpandPath('myfolder/')#"
nameconflict = "overwrite"
result = "thefile">
<cfset thepath = '#Expandpath('myfolder/')#' & '#thefile.ServerFile#'>
<cfinclude template="csvtoarray.cfm">
<cfset result=csvtoarray(#thepath#)>
<cfoutput>
<cfloop index="OuterLoop" from="1" to="#ArrayLen(Result)#">
<cfloop index="InnerLoop" from="1" to="#ArrayLen(Result[OuterLoop])#">
<cfoutput>
Result[#OuterLoop#][#InnerLoop#] is #Result[OuterLoop][InnerLoop]#<br>
</cfoutput>
</cfloop>
</cfloop>
</cfoutput>
If you want an example of uploading a csv (using Ben's csvtoarray) and then looping through each element within an array within an array, here it is:
<!--- Assumes you already processed your form and have the file file... now get your csv file to your server --->
<cffile
action = "UPLOAD"
filefield = "myFile"
destination = "#ExpandPath('myfolder/')#"
nameconflict = "overwrite"
result = "thefile">
<cfset thepath = '#Expandpath('myfolder/')#' & '#thefile.ServerFile#'>
<cfinclude template="csvtoarray.cfm">
<cfset result=csvtoarray(#thepath#)>
<cfoutput>
<cfloop index="OuterLoop" from="1" to="#ArrayLen(Result)#">
<cfloop index="InnerLoop" from="1" to="#ArrayLen(Result[OuterLoop])#">
<cfoutput>
Result[#OuterLoop#][#InnerLoop#] is #Result[OuterLoop][InnerLoop]#<br>
</cfoutput>
</cfloop>
</cfloop>
</cfoutput>
Too complicated solutions. Please post simple ones.
I ran into an issue where there are line breaks in the middle of the qualified text. I thought there would be an easy way to ignore or remove those via regex before running this function but am struggling. Any ideals? Thanks!
Just a thought -
Surely that should be "\r?\n", since Windows newlines look like \r\n and Unix like \n. This way the regexp would accept both \r\n and \n, as opposed to \r\n and \r like the previous code did.
Thank you Ben. You always have the solutions to my problems. Great function.
Dina
It doesn't seem to matter which code example I try using from all the sources and examples provided from your blog post, but all the examples seem to through a coldfusion.runtime.Struct cannot be cast to java.lang.String
Got any suggestions of why? On a CF10 server, though it could be because of the code being built for a CF8 server...