Hello Ben, your website has come up numerous times in Google for my search to an answer that I cannot seem to find anywhere! You do however have related posts to my question - which is [drum roll]:
How do I read and parse large XML files in CF8!? I have multiple xml files of up to 135MB(!) each that I need to parse and INSERT into SQL. The problem appears to be XMLParse. I can read the XML file in via CFFILE no problems, however the XMLParse seems to max out the CF heap space (even after increasing it to 1024MB). From the reading I have done, it appears that because CF8 uses a DOM based approach, it must read in and parse the entire XML file into memory first - which is OK for small XML files, but absolutely kills the server on a 135MB file. People seem to suggest either:
1. Using SAX(?)
2. Changing the default XML parser within CF8 (which I fail to see how this would work as wouldn't it still need to read it into RAM?)
Anyway, I am hoping that you may already solved this in the past? Any help would be greatly appreciated!
This is perhaps the biggest problem with the way ColdFusion parses XML documents; it needs to able to load and parse the entire document in memory before it can return a result to you. I don't think there's anything that you can tweak in the settings to get around this - it's a property of the underlying Java library they use (Xerces I think).
Sure, you could use the SAX XML library, but then you have to start dealing with much more complicated parsing techniques. Plus, you know me - I like to build everything that I can in ColdFusion. It might not always be as fast as the pure Java solution, but when it comes to readability and maintainability, I don't think you can beat a single-technology solution.
So what can we do to get around the parsing limitation of large XML files in ColdFusion? To me, the most obvious solution is to rely on the fact that XML documents follow patterns; XML isn't just a random collection of data - it's structured data with extremely strict rules regarding formatting. And as always, when I think about patterns in text, I think of our very sexy friend, the regular expression.
What if, instead of parsing out an entire document as if it were XML, we looked for sub nodes of that document using XML patterns and then parsed those substrings into XML nodes. Sure, we wouldn't be able to pass around the XML document as a whole, but chances are, especially with extremely large documents, we don't need the information as a whole - we need it piece-wise anyway.
To develop and test our solution, first we need to create our massive XML document:
Launch code in new window » Download code as text file »
Here, we are creating an ORDER XML document that starts off with a properties node and is followed by 10,000 PRODUCT nodes. I don't even know if this scenario necessarily makes sense, but it creates a large document, and that's really all that I need.
To make our solution more usable, we are going to wrap it up in a ColdFusion component, SubNodeXmlParser.cfc. However, before we get into how that component works, let's take a look at how we will be using it. Remember, we can't parse the entire XML document at once, so we need to attack it a node at a time:
Launch code in new window » Download code as text file »
Notice first that our initialization method takes a comma delimited list of node names. This allows us skip over large parts of the XML document, concentrating purely on the nodes for which we have an interest. To get at these nodes, we use the GetNextNode() method. This will scan the XML file as a text document and look for the next XML node pattern. Finding it, it will parse it into a small ColdFusion XML document and return the XML node.
Running the above code, we get the following output:
properties
product
product
product
product
product
.... a few thousand more times ....
As you can see, it found the Properties node as well as all of the Product nodes. When running this code, we have to run in a Conditional loop since we have no idea how large the XML document will be. Essentially, we have to keep asking the parser for more data until it run out (and returns a VOID response).
So again, we do lose something with not being able to see the entire XML document in one view, but since you need to be inserting the data into a database, I am guessing that the piece-wise fashion will suite you just fine.
Ok, so now let's take a look at the ColdFusion component that makes this possible:
Launch code in new window » Download code as text file »
The code for this is quite small and straightforward. The ColdFusion component basically opens up the file as a buffered input streams and makes repeated reads to the stream until it can match a node pattern. Once it matches the node pattern, it parses it out into an XML document and returns the root node (the target node). It then goes back to the buffered input stream for more data. When it has no more data to read and no more pattern matches to make, it simply returns VOID signaling the end of the search.
This solution may not be exactly what you were looking for, but at the least, I hope that it has given you some ideas.
Download Code Snippet ZIP File
Comments (8) | Post Comment | Ask Ben | Permalink | Other Searches | Print Page
OOPhoto: Been A Bit Stumped Lately With The Next Step
XML Building / Parsing / Traversing Speed In ColdFusion
Hi Ben,
Great example and tutorial, thanks!
I also just want to let you know that we had same kind of issue with CF7 for a client in D.C. area 2 years ago and we preferred SAX with Java libraries and as I remember it was not as complicated as expected.
Just my 2 cent. :)
Posted by O?uz Demirkap? on Sep 8, 2008 at 7:41 PM
I have had success using Apache Digester (which comes with CF 7+) in CF to parse a 300MB+ file. Apache Digester is an easy to use SAX parser.
Posted by Kurt Wiersma on Sep 8, 2008 at 10:35 PM
Hi Ben
Question. You state "I like to build everything that I can in ColdFusion", but you use Java to do your regexes and file ops, both of which could have been done in CF, and more simply to boot. Any reason for that?
--
Adam
Posted by Adam Cameron on Sep 9, 2008 at 3:46 AM
re Adam:
I'm guessing that Coldfusion's string handling is particularly slow when it comes to large streams such as the one being discussed. Using the Java string handling expidites the process dramatically.
Or I could be talking nonsense :(
Posted by Paul McCombie on Sep 9, 2008 at 6:14 AM
Strewth; my spelling's bad.
Posted by Paul McCombie on Sep 9, 2008 at 6:46 AM
@Oguz, Kurt,
I will get around to trying those libraries one of these days. My biggest gripe, and this may be totally unfounded, is that I am not sure that they can work using ColdFusion listeners. Again, I am not talking from experience, but from what I have read, these types of libraries use the event listener model; and, since I believe it is quite difficult to invoke a ColdFusion component method from within a Java object, I assume that the listeners passed in have to be actual Java objects, not CFCs. If that is not the case, then it would make trying this much easier.
@Adam,
I am using a buffered input stream so that I don't have to read in the entire file into memory at one time. I guess I could have used a FileRead() action to do this, but frankly, I forgot that ColdFusion has that newer functionality.
As far as the regular expression parsing, using Java's Pattern and Matcher objects is actually much faster and easier to use than a pure CF solution. In ColdFusion, the regular expression find only returns the position and length of the match, and you have to manually keep looping over it with an explicit start value to get at all the matched patterns; using the Java regex utilities, looping over and getting access to all the patterns is extremely straightforward. I have to disagree with you from lots of experience that this would be easier in a pure CF solution.
I would say that using FileRead() might have been a bit easier, though.
Posted by Ben Nadel on Sep 9, 2008 at 8:34 AM
Yeah, it was the fileRead() thing instead of <cffile> I meant there.
In regards to the regex side of things, I realise the Java implementation offers much more power, but how you're using it here doesn't seem to be any different (except in a more convoluted way) than using a single reFind(). You're not doing multiple find() calls on the Matcher, so the benefit of having the Matcher keep track of how far down the string the find() is at is irrelevant.
For a lot of things, the Java implementation seems like a lot of unnecessary horsing around to me, but I suppose if one is using 'em all the time, it becomes second nature. I guess I need some practise!
Still, I converted your code to use native CF and the Java really is an awful lot faster (35sec to 200sec, averaged out!). Also I note that you're using a zero-width positive look-behind (*) in your regex, which CF won't accept. It seems strange that CF isn't just passing the regex straight to the underlying Java regex processor. I wonder why it sticks its beak in? Oh well.
All interesting stuff.
--
Adam
(*) I have no idea what one of those is: I just looked it up when CF errored. I was able to simplify the regex a bit so it worked with reFind()...
Posted by Adam Cameron on Sep 9, 2008 at 6:36 PM
@Adam,
Yeah, that's true - in this scenario, I am not really taking full advantage of the pattern matcher. However, as you point out, Java regular expressions are simply faster and I am using the positive look behind which.... (?<=/) simply means that the character "/" must exist just before this "point" in the pattern.
I agree that it does seem silly that ColdFusion doesn't just off the regular expression stuff to Java. Not sure why.
At the end of the day, this could have been done other ways, but I suppose I am so used to the Java pattern matcher that it just pops into my head as the first tool to try.
Posted by Ben Nadel on Sep 9, 2008 at 6:43 PM