Posted June 13, 2008 at
9:38 AM
Tags:
ColdFusion
Back around Regular Expression Day, RegEx guru Steve Levithan kicked off a little regular expression contest. That contest ends at the close of day today. I have been working on a little something for it, but unfortunately, with my work schedule (I also had a deadline today) I haven't really had time to finish it. It is partially working, however, and I will present what I have so far.
The idea here was to create a ColdFusion XML parser that worked in a similar fashion to the SAX XML parser. That is, that it would read in an XML document using an input stream and parse it piece-wise, rather than as one whole document. Now, before anyone jumps down my throat about this, it was just for FUN and an excuse to leverage regular expressions. This was never meant to be a replacement for SAX or even to be a viable product. This was merely something to experiment with and plant the seeds of new ideas. In the ways of James Dyson (inventor of the dual-cyclone vacuum cleaner), I wanted to work on something that I knew would fail thereby granting myself freedom from worrying.
That being said, the way this works is that there is an XML parser and an XML Listener. The XML Lister supports an interface that listens for events in the XML parser:
- StartDocument()
- StartNode()
- Text()
- EndNode()
- EndDocument()
As the XML Parser finds node events, it passes them off to the XML Listener. In my low-level version, the XML Parser has not understanding the structure of the XML document; it only knows how to interpret node types and parse them into individual XML elements. It is up to the XML Listener to take those XML nodes and do something logical with them.
Let's look at an example. Here is our sample XML document:
Launch code in new window » Download code as text file »
- <?xml version="1.0" encoding="utf-8" ?>
- <girls xmlns:kinky="http://www.kinkysolutions.com">
-
- <kinky:girl>
- <name kinky:id="1">Sarah Vivenzio</name>
- <desc>
- <![CDATA[
- Lorem ipsum dolor sit amet, consectetuer adipiscing
- elit. Nunc aliquet semper pede. Ut placerat leo vel
- ligula. Vestibulum posuere est sed tortor. Cras
- pretium. Vestibulum pulvinar arcu. Aliquam sed pede.
- Quisque elit. Sed scelerisque. Sed facilisis tortor
- a felis. Integer non ante sodales erat rhoncus
- laoreet. Lorem ipsum dolor sit amet, consectetuer
- adipiscing elit.
- ]]>
- </desc>
- </kinky:girl>
-
- </girls>
This isn't all that big, but imagine that this document could be thousands of megabytes or something and we couldn't read it all in at one time without killing the server's RAM. Notice also that it has a combination of namespaces, CDATA nodes, and regular text nodes (to simulate complexity).
Now, let's take a look at our listener, which currently is only listening for XML node events and outputting data - it is not doing anything logical with the data at this point:
Launch code in new window » Download code as text file »
- <cfcomponent
- output="false"
- hint="I listen for events from the xml parser.">
-
- <cffunction
- name="Init"
- access="public"
- returntype="any"
- output="false"
- hint="I initialize and return this object.">
-
- <cfreturn THIS />
- </cffunction>
-
-
- <cffunction
- name="StartDocument"
- access="public"
- returntype="void"
- output="true"
- hint="I execute when a StartDocument event was fired.">
-
- <p>
- Start Document
- </p>
-
- </cffunction>
-
-
- <cffunction
- name="StartNode"
- access="public"
- returntype="void"
- output="true"
- hint="I execute when a StartNode event was fired.">
-
- <p>
- Start Node:
- </p>
-
- <cfdump var="#ARGUMENTS[ 1 ]#" />
-
- </cffunction>
-
-
- <cffunction
- name="Text"
- access="public"
- returntype="void"
- output="true"
- hint="I execute when a TextNode event was fired.">
-
- <p>
- Text Node:
- </p>
-
- <cfdump var="#ARGUMENTS[ 1 ]#" />
-
- </cffunction>
-
-
- <cffunction
- name="EndNode"
- access="public"
- returntype="void"
- output="true"
- hint="I execute when an EndNode event was fired.">
-
- <p>
- End Node:
- </p>
-
- <cfdump var="#ARGUMENTS[ 1 ]#" />
-
- </cffunction>
-
-
- <cffunction
- name="EndDocument"
- access="public"
- returntype="void"
- output="true"
- hint="I execute when an EndDocument event was fired.">
-
- <p>
- End Document
- </p>
-
- </cffunction>
-
- </cfcomponent>
This XML Listener does nothing more than uphold the listener interface.
Now, let's see what happens when we use this (the code will be shown farther down) to listen for XML events:
Launch code in new window » Download code as text file »
- <cfset objListener = CreateObject(
- "component",
- "XmlListener"
- ).Init()
- />
-
- <cfset objParser = CreateObject(
- "component",
- "XmlParser"
- ).Init(
- objListener
- )
- />
-
- <cfset objParser.Parse( ExpandPath( "./data.xml" ) ) />
Hooking the ColdFusion components together is the easy part; we simply create the listener and then the parser, passing one CFC instance into the other. When we call the Parse() method on the XML parser, it starts to read in the XML data using a file input stream and passes node events off to the listener as they occur. When we run the above code, we get the following page output:
As you can see, each node event, whether it is an open, close, self-close, or text node, triggers an event in the XML Listener and the node data is passed through along with the event. As you can also see, the node data that is passed in as an actual XML node. I did this to make it easier for the XML Listener to work with.
Ok, now onto the fun stuff - the regular expression stuff! The way this works is that the XML parser knows how to match XML patterns using this verbose regular expression:
Launch code in new window » Download code as text file »
- (?xi)
-
- ## Make sure that we start off with a regular expression
- ## that is both verbose as well as case-insensitive.
-
- ## Start at the beginning of the string (this will be the
- ## beginning of our data buffer.
-
- ^
-
- ## Ignored by our event parser.
- ## Doc type / encoding property.
-
- (
- \s*
- <\?[\w\W]+?\?>
- \s*
- )|
-
- ## CDATA character match.
-
- (
- \s*
- <!\[CDATA\[
- [\w\W]*?
- \]\]>
- \s*
- )|
-
- ## Text node match.
-
- ([^<]+(?=<))|
-
- ## Self-closing tag match.
-
- (
- <[\w:\-]+
- (?:\s+[\w:\-]+\s*=\s*"[^"]*")*
- \s*
- />
- )|
-
- ## Open tag match.
-
- (
- <[\w:\-]+
- (?:\s+[\w:\-]+\s*=\s*"[^"]*")*
- \s*
- >
- )|
-
- ## Close tag match.
-
- (</[\w:\-]+>)
It starts reading in data from the file input stream. For each buffer read, it tries to match the above pattern. If nothing matches, it goes back to the file for more data to add to the internal buffer. Then it tires to match the above pattern again (from the beginning of the newly enlarged buffer). It keeps doing this repeated read until it finds a match or until it run out of data. When it does find a match, it checks to see which group in the regular expression pattern was actually captured. From the captured group index, it can determine which type of XML node was found - Start tag, end tag, CDATA, etc.. It then chops that match off of the internal buffer, parses it into an actual XML element, triggers an XML event, and then goes back to the internal buffer to start the matching process over.
Launch code in new window » Download code as text file »
- <cfcomponent
- output="false"
- hint="I parse an XML file a little bit at a time using a file input stream so that the whole file doesn't need to be read in and parsed at one time.">
-
-
- <cfset VARIABLES.Instance = {
-
- Listener = "",
-
- Buffer = "",
-
- Pattern = ""
- } />
-
-
- <cfsavecontent variable="VARIABLES.Instance.Pattern"
- >(?xi)
-
- ## Make sure that we start off with a regular expression
- ## that is both verbose as well as case-insensitive.
-
- ## Start at the beginning of the string (this will be the
- ## beginning of our data buffer.
-
- ^
-
- ## Ignored by our event parser.
- ## Doc type / encoding property.
-
- (
- \s*
- <\?[\w\W]+?\?>
- \s*
- )|
-
- ## CDATA character match.
-
- (
- \s*
- <!\[CDATA\[
- [\w\W]*?
- \]\]>
- \s*
- )|
-
- ## Text node match.
-
- ([^<]+(?=<))|
-
- ## Self-closing tag match.
-
- (
- <[\w:\-]+
- (?:\s+[\w:\-]+\s*=\s*"[^"]*")*
- \s*
- />
- )|
-
- ## Open tag match.
-
- (
- <[\w:\-]+
- (?:\s+[\w:\-]+\s*=\s*"[^"]*")*
- \s*
- >
- )|
-
- ## Close tag match.
-
- (</[\w:\-]+>)
-
- </cfsavecontent>
-
-
- <cfset VARIABLES.Instance.Pattern = CreateObject(
- "java",
- "java.util.regex.Pattern"
- ).Compile(
- JavaCast(
- "string",
- Trim( VARIABLES.Instance.Pattern )
- )
- )
- />
-
-
-
- <cffunction
- name="Init"
- access="public"
- returntype="any"
- output="false"
- hint="I initialize and return this object.">
-
- <cfargument
- name="Listener"
- type="any"
- required="true"
- hint="I am the XML parse listener."
- />
-
- <cfset VARIABLES.Instance.Listener = ARGUMENTS.Listener />
-
- <cfreturn THIS />
- </cffunction>
-
-
- <cffunction
- name="AnnounceEvent"
- access="private"
- returntype="void"
- output="true"
- hint="I take an event and announce it to the listener if it is set.">
-
- <cfargument
- name="EventName"
- type="string"
- required="true"
- hint="I am the name of the event to announce."
- />
-
- <cfargument
- name="EventData"
- type="any"
- required="false"
- default="#StructNew()#"
- hint="I am the event data that we are passing to the event listener."
- />
-
- <cfinvoke
- component="#VARIABLES.Instance.Listener#"
- method="#ARGUMENTS.EventName#">
-
- <cfinvokeargument
- name="1"
- value="#ARGUMENTS.EventData#"
- />
-
- </cfinvoke>
-
- <cfreturn />
- </cffunction>
-
-
- <cffunction
- name="GetNextNode"
- access="private"
- returntype="any"
- output="false"
- hint="I get the next whole element from the input stream. That might be a open tag, close tag, or node text.">
-
- <cfargument
- name="InputStream"
- type="any"
- required="true"
- output="false"
- hint="The buffered input stream for our XML file."
- />
-
- <cfset var LOCAL = {} />
-
- <cfset LOCAL.Matcher = VARIABLES.Instance.Pattern.Matcher(
- JavaCast( "string", VARIABLES.Instance.Buffer )
- ) />
-
- <cfif LOCAL.Matcher.Find()>
-
- <cfset LOCAL.IgnoredNode = LOCAL.Matcher.Group(
- JavaCast( "int", 1 )
- ) />
-
- <cfset LOCAL.CDATANode = LOCAL.Matcher.Group(
- JavaCast( "int", 2 )
- ) />
-
- <cfset LOCAL.TextNode = LOCAL.Matcher.Group(
- JavaCast( "int", 3 )
- ) />
-
- <cfset LOCAL.OpenCloseNode = LOCAL.Matcher.Group(
- JavaCast( "int", 4 )
- ) />
-
- <cfset LOCAL.OpenNode = LOCAL.Matcher.Group(
- JavaCast( "int", 5 )
- ) />
-
- <cfset LOCAL.CloseNode = LOCAL.Matcher.Group(
- JavaCast( "int", 6 )
- ) />
-
-
- <cfif StructKeyExists( LOCAL, "IgnoredNode" )>
-
- <cfset LOCAL.Node = {
- Type = "Ignored",
- Node = LOCAL.IgnoredNode
- } />
-
- <cfelseif StructKeyExists( LOCAL, "CDATANode" )>
-
- <cfset LOCAL.Node = {
- Type = "Text",
- Node = LOCAL.CDATANode
- } />
-
- <cfelseif StructKeyExists( LOCAL, "TextNode" )>
-
- <cfset LOCAL.Node = {
- Type = "Text",
- Node = LOCAL.TextNode
- } />
-
- <cfelseif StructKeyExists( LOCAL, "OpenCloseNode" )>
-
- <cfset LOCAL.Node = {
- Type = "SelfClose",
- Node = LOCAL.OpenCloseNode
- } />
-
- <cfelseif StructKeyExists( LOCAL, "OpenNode" )>
-
- <cfset LOCAL.Node = {
- Type = "Open",
- Node = LOCAL.OpenNode
- } />
-
- <cfelseif StructKeyExists( LOCAL, "CloseNode" )>
-
- <cfset LOCAL.Node = {
- Type = "Close",
- Node = LOCAL.CloseNode
- } />
-
- </cfif>
-
-
- <cfif (Len( VARIABLES.Instance.Buffer ) EQ Len( LOCAL.Node.Node ))>
-
- <cfset VARIABLES.Instance.Buffer = "" />
-
- <cfelse>
-
- <cfset VARIABLES.Instance.Buffer = Right(
- VARIABLES.Instance.Buffer,
- (Len( VARIABLES.Instance.Buffer ) - Len( LOCAL.Node.Node ))
- ) />
-
- </cfif>
-
-
- <cfreturn LOCAL.Node />
-
- <cfelse>
-
-
- <cfset LOCAL.Buffer = RepeatString( "12345", 1024 )
- .GetBytes()
- />
-
- <cfset LOCAL.BytesRead = ARGUMENTS.InputStream.Read(
- LOCAL.Buffer,
- JavaCast( "int", 0 ),
- JavaCast( "int", ArrayLen( LOCAL.Buffer ) )
- ) />
-
- <cfif (LOCAL.BytesRead EQ -1)>
-
- <cfreturn />
-
- <cfelse>
-
- <cfset VARIABLES.Instance.Buffer &= Left(
- ToString( LOCAL.Buffer ),
- LOCAL.BytesRead
- ) />
-
- </cfif>
-
-
- <cfreturn GetNextNode( ARGUMENTS.InputStream ) />
-
- </cfif>
- </cffunction>
-
-
- <cffunction
- name="Parse"
- access="public"
- returntype="void"
- output="true"
- hint="I parse the given XML file and announce events as I parse it.">
-
- <cfargument
- name="FilePath"
- type="string"
- required="true"
- hint="I am the expanded path to the given xml file."
- />
-
- <cfset var LOCAL = {} />
-
- <cfset LOCAL.InputStream = CreateObject(
- "java",
- "java.io.BufferedInputStream"
- ).Init(
- CreateObject(
- "java",
- "java.io.FileInputStream"
- ).Init(
- JavaCast( "string", ARGUMENTS.FilePath )
- )
- )
- />
-
-
- <cfset AnnounceEvent( "StartDocument" ) />
-
- <cfloop condition="true">
-
- <cfset LOCAL.Node = GetNextNode( LOCAL.InputStream ) />
-
- <cfif StructKeyExists( LOCAL, "Node" )>
-
- <cfif (LOCAL.Node.Type NEQ "Ignored")>
-
- <cfset LOCAL.ParsedNode = ParseNode(
- LOCAL.Node.Type,
- LOCAL.Node.Node
- ) />
-
- <cfswitch expression="#LOCAL.Node.Type#">
- <cfcase value="SelfClose">
-
- <cfset AnnounceEvent( "StartNode", LOCAL.ParsedNode ) />
- <cfset AnnounceEvent( "EndNode", LOCAL.ParsedNode ) />
-
- </cfcase>
- <cfcase value="Open">
-
- <cfset AnnounceEvent( "StartNode", LOCAL.ParsedNode ) />
-
- </cfcase>
- <cfcase value="Text">
-
- <cfset AnnounceEvent( "Text", LOCAL.ParsedNode ) />
-
- </cfcase>
- <cfcase value="Close">
-
- <cfset AnnounceEvent( "EndNode", LOCAL.ParsedNode ) />
-
- </cfcase>
- </cfswitch>
-
- </cfif>
-
- <cfelse>
-
- <cfbreak />
-
- </cfif>
-
- </cfloop>
-
- <cfset AnnounceEvent( "EndDocument" ) />
-
- <cfset LOCAL.InputStream.Close() />
-
- <cfreturn />
- </cffunction>
-
-
- <cffunction
- name="ParseNode"
- access="private"
- returntype="any"
- output="false"
- hint="I take an XML node string and parse it into an XML node.">
-
- <cfargument
- name="NodeType"
- type="string"
- required="true"
- hint="I am the type of node that was found."
- />
-
- <cfargument
- name="NodeString"
- type="string"
- required="true"
- hint="I am a string representation of part of an XML node."
- />
-
- <cfset var LOCAL = {} />
-
- <cfswitch expression="#ARGUMENTS.NodeType#">
- <cfcase value="SelfClose">
-
-
- </cfcase>
- <cfcase value="Open">
-
- <cfset ARGUMENTS.NodeString = REReplace(
- ARGUMENTS.NodeString,
- ">$",
- "/>",
- "one"
- ) />
-
- </cfcase>
- <cfcase value="Text">
-
-
- </cfcase>
- <cfcase value="Close">
-
- <cfset ARGUMENTS.NodeString = REReplace(
- ARGUMENTS.NodeString,
- "^</([^>]+)>$",
- "<\1/>",
- "one"
- ) />
-
- </cfcase>
- </cfswitch>
-
-
- <cfset LOCAL.NameSpaces = REMatch(
- "^<[\w\d\-]+:[\w\d\-]+|[\w\d\-]+:[\w\d\-]+\s*=",
- ARGUMENTS.NodeString
- ) />
-
- <cfset LOCAL.NameSpaceString = "" />
-
- <cfloop
- index="LOCAL.NameSpace"
- array="#LOCAL.NameSpaces#">
-
- <cfset LOCAL.NameSpace = ListFirst(
- Replace( LOCAL.NameSpace, "<", "", "one" ),
- ":"
- ) />
-
- <cfif (LOCAL.NameSpace NEQ "xmlns")>
-
- <cfset LOCAL.NameSpaceString &= (
- " xmlns:" &
- LOCAL.NameSpace &
- "=""http://www.domain.ext"""
- ) />
-
- </cfif>
-
- </cfloop>
-
-
- <cfxml variable="LOCAL.Xml">
- <cfoutput>
- <root #LOCAL.NameSpaceString#>
- <data>#ARGUMENTS.NodeString#</data>
- <data2>place-holder</data2>
- </root>
- </cfoutput>
- </cfxml>
-
-
- <cfif (ARGUMENTS.NodeType EQ "Text")>
-
- <cfset LOCAL.ChildNodes = XmlSearch(
- LOCAL.Xml,
- "/root/data2/text()"
- ) />
-
- <cfset LOCAL.ChildNodes[ 1 ].XmlValue = LOCAL.Xml.root.data.XmlText />
-
- <cfelse>
-
- <cfset LOCAL.ChildNodes = XmlSearch(
- LOCAL.Xml,
- "/root/data/*[ 1 ]/"
- ) />
-
- </cfif>
-
-
- <cfreturn LOCAL.ChildNodes[ 1 ] />
- </cffunction>
-
- </cfcomponent>
This was a fun little piece of functionality to hack away at. It has a ton of short comings, such as not being able to handle comments or invalid XML data, but, like I said, this was really just a though experiment and little more than an excuse to use some regular expressions.
Download Code Snippet ZIP File
Comments (3) |
Post Comment |
Ask Ben |
Permalink |
Other Searches |
Print Page
What Other People Are Searching For
[ local search ]
coldfusion xml parsing
[ local search ]
parse part of an xml document in coldfusion
Nice. Wow, you put a lot of work into that... more than I was expecting from contestants. Parsing XML with regular expressions is something I've wanted to take a stab at for a long time since it sounds like a lot of fun (even though there are established implementations and possibly better approaches out there). Again, really nice work.
As noted, my regex contest ( http://blog.stevenlevithan.com/archives/regexday-2008 ) ends today, so if anyone else wants to get in on the action, you'd better hurry. Entries can be as simple as a single regular expression that matches something interesting or creative.
Posted by Steven Levithan
on Jun 13, 2008
at 10:48 AM
Thanks Steve. This was fun. Aside from the fact that it can't handle a lot of things, it was much less complicated than I thought it would be.
Posted by Ben Nadel
on Jun 13, 2008
at 11:21 AM
I have written a RegExp based XML parser in PHP5. It was meant to parse a few XML nodes in large and complex XML streams, given an XPath-like selector (very simplified).
It had - and still has, the engine still being used in this old application I wrote - some (major ?) limitations and performance issues :
- No tolerance for invalid XML structure. Unknown XML entities where accepted if not in the selected node, cause they were not parsed until after the selection of the XML string code of the adequate node.
- Due to the engine "forward" behavior, the entire stream could be parsed in the worst case scenarii (repetitive structure and selected node in the end of the stream, "almost flat" XML, huge nested nodes, etc). Since I did not parse the entire structure, I found no way to sort/hash the nodes in order to fasten the search.
- The overall performance could be (much) poorer on slow (old) servers with some added memory, compared to the widely used XML parsers available in PHP5 (the very intense use of the proc by the PCRE engine did much more harm than the huge memory consumption of the common XML parsers).
Note : I do not know of this SAX XML parser thing ... If easily callable from a PHP script, it could be a good replacement.
Note 2 : Please excuse the wierd english sentences, if any, since I'm french.
Posted by RIVAUD Matthieu
on Jul 23, 2008
at 1:18 PM
Post Comment |
Ask Ben