Ask Ben: Parsing Very Large XML Documents In ColdFusion

Posted September 8, 2008 at 4:08 PM

Tags: ColdFusion, Ask Ben

Hello Ben, your website has come up numerous times in Google for my search to an answer that I cannot seem to find anywhere! You do however have related posts to my question - which is [drum roll]:

How do I read and parse large XML files in CF8!? I have multiple xml files of up to 135MB(!) each that I need to parse and INSERT into SQL. The problem appears to be XMLParse. I can read the XML file in via CFFILE no problems, however the XMLParse seems to max out the CF heap space (even after increasing it to 1024MB). From the reading I have done, it appears that because CF8 uses a DOM based approach, it must read in and parse the entire XML file into memory first - which is OK for small XML files, but absolutely kills the server on a 135MB file. People seem to suggest either:

1. Using SAX(?)
2. Changing the default XML parser within CF8 (which I fail to see how this would work as wouldn't it still need to read it into RAM?)

Anyway, I am hoping that you may already solved this in the past? Any help would be greatly appreciated!

This is perhaps the biggest problem with the way ColdFusion parses XML documents; it needs to able to load and parse the entire document in memory before it can return a result to you. I don't think there's anything that you can tweak in the settings to get around this - it's a property of the underlying Java library they use (Xerces I think).

Sure, you could use the SAX XML library, but then you have to start dealing with much more complicated parsing techniques. Plus, you know me - I like to build everything that I can in ColdFusion. It might not always be as fast as the pure Java solution, but when it comes to readability and maintainability, I don't think you can beat a single-technology solution.

So what can we do to get around the parsing limitation of large XML files in ColdFusion? To me, the most obvious solution is to rely on the fact that XML documents follow patterns; XML isn't just a random collection of data - it's structured data with extremely strict rules regarding formatting. And as always, when I think about patterns in text, I think of our very sexy friend, the regular expression.

What if, instead of parsing out an entire document as if it were XML, we looked for sub nodes of that document using XML patterns and then parsed those substrings into XML nodes. Sure, we wouldn't be able to pass around the XML document as a whole, but chances are, especially with extremely large documents, we don't need the information as a whole - we need it piece-wise anyway.

To develop and test our solution, first we need to create our massive XML document:

 Launch code in new window » Download code as text file »

  • <!--- Create a very large XML file. --->
  • <cfsavecontent variable="strXML">
  • <cfoutput>
  • <order>
  •  
  • <!--- Order properties. --->
  • <properties
  • date="September 8, 2008"
  • time="13:42"
  • vendor="Kinky Solutions"
  • />
  •  
  • <!--- Properties in order. --->
  • <products>
  •  
  • <!---
  • Loop over a large number of "products" to
  • create a long XML file.
  • --->
  • <cfloop
  • index="intI"
  • from="1"
  • to="10000"
  • step="1">
  •  
  • <product>
  • <sku>SKU#intI#</sku>
  • <name>Product #intI#</name>
  • <price>#RandRange( 1, 99 )#.99</price>
  • <quantity>#RandRange( 1, 5 )#</quantity>
  • </product>
  •  
  • </cfloop>
  •  
  • </products>
  •  
  • </order>
  • </cfoutput>
  • </cfsavecontent>
  •  
  •  
  • <!--- Write the XML data to the file. --->
  • <cffile
  • action="write"
  • file="#ExpandPath( './products.xml' )#"
  • output="#strXML#"
  • />

Here, we are creating an ORDER XML document that starts off with a properties node and is followed by 10,000 PRODUCT nodes. I don't even know if this scenario necessarily makes sense, but it creates a large document, and that's really all that I need.

To make our solution more usable, we are going to wrap it up in a ColdFusion component, SubNodeXmlParser.cfc. However, before we get into how that component works, let's take a look at how we will be using it. Remember, we can't parse the entire XML document at once, so we need to attack it a node at a time:

 Launch code in new window » Download code as text file »

  • <!---
  • Create the Sub XML node parser. We are going to have this
  • parser look for both the PROPERTIES and the PRODUCT nodes
  • (by passing in a comma delimited list of node names).
  • --->
  • <cfset objParser = CreateObject(
  • "component",
  • "SubNodeXmlParser"
  • ).Init(
  • "properties, product",
  • ExpandPath( "./products.xml" )
  • )
  • />
  •  
  •  
  • <!---
  • Output the names of all the nodes found. We need to use a
  • conditional loop since we don't know how many nodes there
  • will be.
  • --->
  • <cfloop condition="true">
  •  
  • <!--- Get the next node. --->
  • <cfset VARIABLES.Node = objParser.GetNextNode() />
  •  
  • <!---
  • Check to see if the node was found. If not, then the
  • variable, Node, will have been destroyed and will no
  • longer exist in its parent scope.
  • --->
  • <cfif StructKeyExistS( VARIABLES, "Node" )>
  •  
  • <!--- Output name of node. --->
  • #VARIABLES.Node.XmlName#<br />
  •  
  • <cfelse>
  •  
  • <!--- We are done finding nodes so break out. --->
  • <cfbreak />
  •  
  • </cfif>
  •  
  • </cfloop>

Notice first that our initialization method takes a comma delimited list of node names. This allows us skip over large parts of the XML document, concentrating purely on the nodes for which we have an interest. To get at these nodes, we use the GetNextNode() method. This will scan the XML file as a text document and look for the next XML node pattern. Finding it, it will parse it into a small ColdFusion XML document and return the XML node.

Running the above code, we get the following output:

properties
product
product
product
product
product
.... a few thousand more times ....

As you can see, it found the Properties node as well as all of the Product nodes. When running this code, we have to run in a Conditional loop since we have no idea how large the XML document will be. Essentially, we have to keep asking the parser for more data until it run out (and returns a VOID response).

So again, we do lose something with not being able to see the entire XML document in one view, but since you need to be inserting the data into a database, I am guessing that the piece-wise fashion will suite you just fine.

Ok, so now let's take a look at the ColdFusion component that makes this possible:

 Launch code in new window » Download code as text file »

  • <cfcomponent
  • output="false"
  • hint="I help to parse large XML files by matching patterns and then only parsing sub-nodes of the document.">
  •  
  •  
  • <cffunction
  • name="Init"
  • access="public"
  • returntype="any"
  • output="false"
  • hint="I return an intialized object.">
  •  
  • <!--- Define arguments. --->
  • <cfargument
  • name="Nodes"
  • type="string"
  • required="true"
  • hint="I am the list of node names that will be parsed using regular expressions."
  • />
  •  
  • <cfargument
  • name="XmlFilePath"
  • type="string"
  • required="true"
  • hint="I am the file path for the large XML file to be parsed."
  • />
  •  
  • <cfargument
  • name="BufferSize"
  • type="numeric"
  • required="false"
  • default="#(1024 * 1024 * 5)#"
  • hint="I am the size of the buffer which will be used to make reads to the input stream."
  • />
  •  
  • <!--- Define the local scope. --->
  • <cfset var LOCAL = {} />
  •  
  • <!---
  • Create the regular expression pattern based on the
  • node list. We have to match both standard nodes and
  • self-closing nodes. The first thing we have to do is
  • clean up the node list.
  • --->
  • <cfset LOCAL.Nodes = ListChangeDelims(
  • ARGUMENTS.Nodes,
  • "|",
  • ", "
  • ) />
  •  
  • <!--- Define the pattern. --->
  • <cfset LOCAL.Pattern = (
  • "(?i)" &
  • "<(#LOCAL.Nodes#)\b[^>]*(?<=/)>|" &
  • "<(#LOCAL.Nodes#)\b[^>]*>[\w\W]*?</\2>"
  • ) />
  •  
  • <!--- Set up the instance variables. --->
  • <cfset VARIABLES.Instance = {
  •  
  • <!---
  • This the compiled version of our regular
  • expression pattern. By compiling the pattern,
  • it allows us to access the Matcher functionality
  • later on.
  • --->
  • Pattern = CreateObject(
  • "java",
  • "java.util.regex.Pattern"
  • ).Compile(
  • JavaCast( "string", LOCAL.Pattern )
  • ),
  •  
  • <!---
  • This is the data buffer that will hold our
  • partial XML file data.
  • --->
  • DataBuffer = "",
  •  
  • <!---
  • The transfer buffer is what we will use to
  • transfer data from the input file stream into
  • our data buffer. It is this buffer that will
  • determine the size of each file read.
  • --->
  • TransferBuffer = RepeatString( " ", ARGUMENTS.BufferSize ).GetBytes(),
  •  
  • <!---
  • This will be our buffered file input stream
  • which let us read in the large XML file a
  • chunk at a time.
  • --->
  • InputStream = ""
  •  
  • } />
  •  
  • <!---
  • Setup the file intput stream. This buffere input
  • stream will all us to read in the XML file in
  • chunks rather than as a whole.
  • --->
  • <cfset VARIABLES.Instance.InputStream = CreateObject(
  • "java",
  • "java.io.BufferedInputStream"
  • ).Init(
  • CreateObject(
  • "java",
  • "java.io.FileInputStream"
  • ).Init(
  • JavaCast(
  • "string",
  • ARGUMENTS.XmlFilePath
  • )
  • )
  • )
  • />
  •  
  • <!--- Return an intialized object. --->
  • <cfreturn THIS />
  • </cffunction>
  •  
  •  
  • <cffunction
  • name="Close"
  • access="public"
  • returntype="void"
  • output="false"
  • hint="This closes the input file stream. It is recommended that you call this if you finish before all nodes have been matched.">
  •  
  • <!--- Close the file input stream. --->
  • <cfset VARIABLES.Instance.InputStream.Close() />
  •  
  • <!--- Return out. --->
  • <cfreturn />
  • </cffunction>
  •  
  •  
  • <cffunction
  • name="GetNextNode"
  • access="public"
  • returntype="any"
  • output="false"
  • hint="I return the next node in the XML document. If no node can be found, I return VOID.">
  •  
  • <!--- Define the local scope. --->
  • <cfset var LOCAL = {} />
  •  
  • <!--- Create a matcher for our current buffer. --->
  • <cfset LOCAL.Matcher = VARIABLES.Instance.Pattern.Matcher(
  • JavaCast( "string", VARIABLES.Instance.DataBuffer )
  • ) />
  •  
  •  
  • <!--- Try to find the next node. --->
  • <cfif LOCAL.Matcher.Find()>
  •  
  • <!---
  • The matcher found a pattern match. Let's pull out
  • the matching XML.
  • --->
  • <cfset LOCAL.XMLData = LOCAL.Matcher.Group() />
  •  
  • <!---
  • Now that we have the pattern matched, we need to
  • figure out how many characters to leave in our
  • buffer.
  • --->
  • <cfset LOCAL.CharsToLeave = (
  • Len( VARIABLES.Instance.DataBuffer ) -
  • (LOCAL.Matcher.Start() + Len( LOCAL.XMLData ))
  • ) />
  •  
  • <!---
  • Check to see if we have any characters to leave
  • in the buffer after this match.
  • --->
  • <cfif LOCAL.CharsToLeave>
  •  
  • <!--- Trim the buffer. --->
  • <cfset VARIABLES.Instance.DataBuffer = Right(
  • VARIABLES.Instance.DataBuffer,
  • LOCAL.CharsToLeave
  • ) />
  •  
  • <cfelse>
  •  
  • <!---
  • No character data should be left in the
  • buffer. Just set it to empyt string.
  • --->
  • <cfset VARIABLES.Instance.DataBuffer = "" />
  •  
  • </cfif>
  •  
  • <!---
  • Now that we have the buffer updated, parse the
  • XML data and return the root element.
  • --->
  • <cfreturn
  • XmlParse( Trim( LOCAL.XMLData ) )
  • .XmlRoot
  • />
  •  
  • <cfelse>
  •  
  • <!---
  • The pattern matcher could not find the next node.
  • This might be because our buffer does contain
  • enough information. Let's try to read more of our
  • XML file into the buffer.
  • --->
  •  
  • <!--- Read input stream into local buffer. --->
  • <cfset LOCAL.BytesRead = VARIABLES.Instance.InputStream.Read(
  • VARIABLES.Instance.TransferBuffer,
  • JavaCast( "int", 0 ),
  • JavaCast( "int", ArrayLen( VARIABLES.Instance.TransferBuffer ) )
  • ) />
  •  
  • <!---
  • Check to see if we read any bytes. If we didn't
  • then we have run out of data to read and cannot
  • possibly match any more node patterns; just
  • return void.
  • --->
  • <cfif (LOCAL.BytesRead EQ -1)>
  •  
  • <!--- Release the file input stream. --->
  • <cfset THIS.Close() />
  •  
  • <!--- No more data to be matched. --->
  • <cfreturn />
  •  
  • <cfelse>
  •  
  • <!---
  • We have read data in from the buffered file
  • input stream. Now, let's append that to our
  • internal buffer. Be sure to only move over
  • the bytes that were read - this might not
  • include the whole buffer contents.
  • --->
  • <cfset VARIABLES.Instance.DataBuffer &= Left(
  • ToString( VARIABLES.Instance.TransferBuffer ),
  • LOCAL.BytesRead
  • ) />
  •  
  • </cfif>
  •  
  •  
  • <!---
  • Now that we have updated our buffer, we want to
  • give the pattern matcher another change to find
  • the node pattern.
  • --->
  • <cfreturn GetNextNode() />
  •  
  • </cfif>
  • </cffunction>
  •  
  • </cfcomponent>

The code for this is quite small and straightforward. The ColdFusion component basically opens up the file as a buffered input streams and makes repeated reads to the stream until it can match a node pattern. Once it matches the node pattern, it parses it out into an XML document and returns the root node (the target node). It then goes back to the buffered input stream for more data. When it has no more data to read and no more pattern matches to make, it simply returns VOID signaling the end of the search.

This solution may not be exactly what you were looking for, but at the least, I hope that it has given you some ideas.

Download Code Snippet ZIP File

Post Comment  |  Ask Ben  |  Other Searches  |  Print Page





Reader Comments

Sep 8, 2008 at 7:41 PM // reply »
8 Comments

Hi Ben,

Great example and tutorial, thanks!

I also just want to let you know that we had same kind of issue with CF7 for a client in D.C. area 2 years ago and we preferred SAX with Java libraries and as I remember it was not as complicated as expected.

Just my 2 cent. :)


Sep 8, 2008 at 10:35 PM // reply »
2 Comments

I have had success using Apache Digester (which comes with CF 7+) in CF to parse a 300MB+ file. Apache Digester is an easy to use SAX parser.


Sep 9, 2008 at 3:46 AM // reply »
64 Comments

Hi Ben
Question. You state "I like to build everything that I can in ColdFusion", but you use Java to do your regexes and file ops, both of which could have been done in CF, and more simply to boot. Any reason for that?

--
Adam


Sep 9, 2008 at 6:14 AM // reply »
4 Comments

re Adam:

I'm guessing that Coldfusion's string handling is particularly slow when it comes to large streams such as the one being discussed. Using the Java string handling expidites the process dramatically.

Or I could be talking nonsense :(


Sep 9, 2008 at 6:46 AM // reply »
4 Comments

Strewth; my spelling's bad.


Sep 9, 2008 at 8:34 AM // reply »
7,572 Comments

@Oguz, Kurt,

I will get around to trying those libraries one of these days. My biggest gripe, and this may be totally unfounded, is that I am not sure that they can work using ColdFusion listeners. Again, I am not talking from experience, but from what I have read, these types of libraries use the event listener model; and, since I believe it is quite difficult to invoke a ColdFusion component method from within a Java object, I assume that the listeners passed in have to be actual Java objects, not CFCs. If that is not the case, then it would make trying this much easier.

@Adam,

I am using a buffered input stream so that I don't have to read in the entire file into memory at one time. I guess I could have used a FileRead() action to do this, but frankly, I forgot that ColdFusion has that newer functionality.

As far as the regular expression parsing, using Java's Pattern and Matcher objects is actually much faster and easier to use than a pure CF solution. In ColdFusion, the regular expression find only returns the position and length of the match, and you have to manually keep looping over it with an explicit start value to get at all the matched patterns; using the Java regex utilities, looping over and getting access to all the patterns is extremely straightforward. I have to disagree with you from lots of experience that this would be easier in a pure CF solution.

I would say that using FileRead() might have been a bit easier, though.


Sep 9, 2008 at 6:36 PM // reply »
64 Comments

Yeah, it was the fileRead() thing instead of <cffile> I meant there.

In regards to the regex side of things, I realise the Java implementation offers much more power, but how you're using it here doesn't seem to be any different (except in a more convoluted way) than using a single reFind(). You're not doing multiple find() calls on the Matcher, so the benefit of having the Matcher keep track of how far down the string the find() is at is irrelevant.

For a lot of things, the Java implementation seems like a lot of unnecessary horsing around to me, but I suppose if one is using 'em all the time, it becomes second nature. I guess I need some practise!

Still, I converted your code to use native CF and the Java really is an awful lot faster (35sec to 200sec, averaged out!). Also I note that you're using a zero-width positive look-behind (*) in your regex, which CF won't accept. It seems strange that CF isn't just passing the regex straight to the underlying Java regex processor. I wonder why it sticks its beak in? Oh well.

All interesting stuff.

--
Adam

(*) I have no idea what one of those is: I just looked it up when CF errored. I was able to simplify the regex a bit so it worked with reFind()...


Sep 9, 2008 at 6:43 PM // reply »
7,572 Comments

@Adam,

Yeah, that's true - in this scenario, I am not really taking full advantage of the pattern matcher. However, as you point out, Java regular expressions are simply faster and I am using the positive look behind which.... (?<=/) simply means that the character "/" must exist just before this "point" in the pattern.

I agree that it does seem silly that ColdFusion doesn't just off the regular expression stuff to Java. Not sure why.

At the end of the day, this could have been done other ways, but I suppose I am so used to the Java pattern matcher that it just pops into my head as the first tool to try.


Jul 18, 2009 at 3:47 AM // reply »
2 Comments

I was able to parse approx 5,000,000 "rows/node" from a 13 GB XML file using <cfloop file="file.xml" index="currentLine"> XML parse and such </cfloop> in CF 8. My understanding is that this uses Java file streaming to get to the meat.

In this instance, the 100+ MB file sound simple. Java burps at about 5 million with an out of memory error, however. .NET has the same problem at about the same place so not sure if there is some funky line or the lack of memory management (always an issue in CF) control is creating the problem.


Jul 18, 2009 at 12:19 PM // reply »
7,572 Comments

@Crania,

Sounds like it might be a garbage collection problem. On requests that handle a large amount of information within a current request, even in small increments, I have found that ColdFusion has some trouble with garbage clean up. It seems to require a new page request to clean up some of the memory used up in the previous request. As such, I generally have to break mammoth tasks up across various page requests.


Jul 21, 2009 at 9:45 PM // reply »
2 Comments

Thanks for the reply.

I am still battling the file. I thought I would pit .Net against cf but both seem to quit at just over 5 million rows in.

Perhaps there is a utility out there which will split the file up based on newline chars.

Will post back here when I find a solution. A 1 mil line XML file worked great with the cfloop. It could be that line 5 million has a flaw which blow up heap size. :)


Jul 22, 2009 at 8:09 AM // reply »
7,572 Comments

@Crania,

Good luck! If you attack this in multiple pages requests, I am sure you can get something working.


Aug 1, 2009 at 1:55 PM // reply »
2 Comments

Given a Sql Server back end, seems like it would be simpler and faster to load the xml doc into Sql Server and then parse it out to an 'edge table' with openXml(). Sql Server can quickly parse very large xml docs, and edge tables, like flat files, are easy to work with.

Since you can easily import xml docs into Sql Server from the query pane, I assume that it can also be done through cfquery, so grabbing the xml doc & parsing it in Sql Server is managed through cfquery with a few lines of t-sql.


Aug 5, 2009 at 9:28 AM // reply »
7,572 Comments

@Bill,

That sounds pretty cool. I have not used SQL server to parse XML documents before. Thanks for the tip.


Aug 6, 2009 at 3:36 PM // reply »
1 Comments

Some great ideas. I actually tried to break the doc up initially and found something stupid - after row 5,000,000 (ish) the file was full of [spaces] - so the "line" that bot .NET and CF were blowing up on was one 5GB line of spaces. Once I split the file up and opened up some of the smaller files in a text editor it was easy to spot.

I'm quite happy with how the cfloop performed on a large file - was pretty skeptical, but it made it through about 5GB of data without a hiccup.


Oct 12, 2009 at 3:17 PM // reply »
1 Comments

Crania, Do you have any examples of this. The XML feed that I am being given does not contain any line breaks and so it does not seem to want to loop around it correctly.


Post Comment  |  Ask Ben

Recent Blog Comments
Mar 22, 2010 at 10:20 AM
POIUtility.cfc Examples For Reading And Writing Excel Files In ColdFusion
Can you please show me how to write to multiple sheets of a workbook? Thanks for your cool cfc. Khoa. ... read »
Lee
Mar 22, 2010 at 10:08 AM
Javascript's Implicit Boolean Conversions
I would certainly still use if(strValue.length > 0) over if(strValue) simply because I believe it makes the code more self-documenting. Not everyone knows that an empty string evaluates to false. ... read »
Mar 22, 2010 at 7:43 AM
Terms Of Service / Privacy Policy Document Generator
Thankyou for this very helpful form. You've made my life much easier today. I'll have a look around your site... I'm sure there's some more good stuff here..Thanks Dave ... read »
Mar 22, 2010 at 7:21 AM
Encountered "(. Incorrect Select Statement, Expecting a 'FROM', But Encountered '(' Instead, A Select Statement Should Have a 'FROM' Construct.
I got this exception now. In case you're using var-es local struct, CF gives you couple of "new" exceptions: Encountered "local. and Encountered "id. Incorrect Select List, Incorrect select colum ... read »
Mar 22, 2010 at 3:08 AM
Ask Ben: Selecting XML Attributes Given Other XML Attributes
Thanks for the response. I finally discovered that I was getting this error because I had cfsetting enablecfoutputonly="yes" in Application.cfc, and was neither setting it to false elsewhere nor brac ... read »
Mar 21, 2010 at 8:57 PM
The Bourne Ultimatum Starring Matt Damon And Julia Stiles
late to the party, but my observation is this: rewatch carefully for the platonic nature of the relationship between nicki and jason. she never flirts with him. he never comes on to her. they alway ... read »
Mar 21, 2010 at 7:40 PM
Is Simulating User-Input Events With jQuery Ever A Good Idea?
A couple of things. One you embed the initial state of of more-info in the CSS. IMHO, that behavior should be in jQuery: moreInfo.hide(); It shows that the behavior your toggling and closing is mor ... read »
Mar 21, 2010 at 3:59 PM
Exploring ColdFusion Component Runtime Class Properties And Serialization
@Elliott, according to Ben's experiment, serializeJSON() doesn't access the private data by default - it doesn't even access the getHair() method - so trying to clone a Girl.cfc via serializeJSON/des ... read »