Partial Entry For Steve Levithan's Regular Expression Contest
Back around Regular Expression Day, RegEx guru Steve Levithan kicked off a little regular expression contest. That contest ends at the close of day today. I have been working on a little something for it, but unfortunately, with my work schedule (I also had a deadline today) I haven't really had time to finish it. It is partially working, however, and I will present what I have so far.
The idea here was to create a ColdFusion XML parser that worked in a similar fashion to the SAX XML parser. That is, that it would read in an XML document using an input stream and parse it piece-wise, rather than as one whole document. Now, before anyone jumps down my throat about this, it was just for FUN and an excuse to leverage regular expressions. This was never meant to be a replacement for SAX or even to be a viable product. This was merely something to experiment with and plant the seeds of new ideas. In the ways of James Dyson (inventor of the dual-cyclone vacuum cleaner), I wanted to work on something that I knew would fail thereby granting myself freedom from worrying.
That being said, the way this works is that there is an XML parser and an XML Listener. The XML Lister supports an interface that listens for events in the XML parser:
- StartDocument()
- StartNode()
- Text()
- EndNode()
- EndDocument()
As the XML Parser finds node events, it passes them off to the XML Listener. In my low-level version, the XML Parser has not understanding the structure of the XML document; it only knows how to interpret node types and parse them into individual XML elements. It is up to the XML Listener to take those XML nodes and do something logical with them.
Let's look at an example. Here is our sample XML document:
<?xml version="1.0" encoding="utf-8" ?>
<friends xmlns:kinky="http://www.kinkysolutions.com">
<kinky:friend>
<name kinky:id="1">Sarah Vivenzio</name>
<desc>
<![CDATA[
Lorem ipsum dolor sit amet, consectetuer adipiscing
elit. Nunc aliquet semper pede. Ut placerat leo vel
ligula. Vestibulum posuere est sed tortor. Cras
pretium. Vestibulum pulvinar arcu. Aliquam sed pede.
Quisque elit. Sed scelerisque. Sed facilisis tortor
a felis. Integer non ante sodales erat rhoncus
laoreet. Lorem ipsum dolor sit amet, consectetuer
adipiscing elit.
]]>
</desc>
</kinky:friend>
</friends>
This isn't all that big, but imagine that this document could be thousands of megabytes or something and we couldn't read it all in at one time without killing the server's RAM. Notice also that it has a combination of namespaces, CDATA nodes, and regular text nodes (to simulate complexity).
Now, let's take a look at our listener, which currently is only listening for XML node events and outputting data - it is not doing anything logical with the data at this point:
<cfcomponent
output="false"
hint="I listen for events from the xml parser.">
<cffunction
name="Init"
access="public"
returntype="any"
output="false"
hint="I initialize and return this object.">
<!--- Return This reference. --->
<cfreturn THIS />
</cffunction>
<cffunction
name="StartDocument"
access="public"
returntype="void"
output="true"
hint="I execute when a StartDocument event was fired.">
<p>
Start Document
</p>
</cffunction>
<cffunction
name="StartNode"
access="public"
returntype="void"
output="true"
hint="I execute when a StartNode event was fired.">
<p>
Start Node:
</p>
<cfdump var="#ARGUMENTS[ 1 ]#" />
</cffunction>
<cffunction
name="Text"
access="public"
returntype="void"
output="true"
hint="I execute when a TextNode event was fired.">
<p>
Text Node:
</p>
<cfdump var="#ARGUMENTS[ 1 ]#" />
</cffunction>
<cffunction
name="EndNode"
access="public"
returntype="void"
output="true"
hint="I execute when an EndNode event was fired.">
<p>
End Node:
</p>
<cfdump var="#ARGUMENTS[ 1 ]#" />
</cffunction>
<cffunction
name="EndDocument"
access="public"
returntype="void"
output="true"
hint="I execute when an EndDocument event was fired.">
<p>
End Document
</p>
</cffunction>
</cfcomponent>
This XML Listener does nothing more than uphold the listener interface.
Now, let's see what happens when we use this (the code will be shown farther down) to listen for XML events:
<!--- Create a listener for our XML parsing. --->
<cfset objListener = CreateObject(
"component",
"XmlListener"
).Init()
/>
<!---
Create a parser and pass in our listener. This parser
will trigger events in the listener as it finds nodes
in the XML document.
--->
<cfset objParser = CreateObject(
"component",
"XmlParser"
).Init(
objListener
)
/>
<!--- Parse the XML document. --->
<cfset objParser.Parse( ExpandPath( "./data.xml" ) ) />
Hooking the ColdFusion components together is the easy part; we simply create the listener and then the parser, passing one CFC instance into the other. When we call the Parse() method on the XML parser, it starts to read in the XML data using a file input stream and passes node events off to the listener as they occur. When we run the above code, we get the following page output:
As you can see, each node event, whether it is an open, close, self-close, or text node, triggers an event in the XML Listener and the node data is passed through along with the event. As you can also see, the node data that is passed in as an actual XML node. I did this to make it easier for the XML Listener to work with.
Ok, now onto the fun stuff - the regular expression stuff! The way this works is that the XML parser knows how to match XML patterns using this verbose regular expression:
(?xi)
## Make sure that we start off with a regular expression
## that is both verbose as well as case-insensitive.
## Start at the beginning of the string (this will be the
## beginning of our data buffer.
^
## Ignored by our event parser.
## Doc type / encoding property.
(
\s*
<\?[\w\W]+?\?>
\s*
)|
## CDATA character match.
(
\s*
<!\[CDATA\[
[\w\W]*?
\]\]>
\s*
)|
## Text node match.
([^<]+(?=<))|
## Self-closing tag match.
(
<[\w:\-]+
(?:\s+[\w:\-]+\s*=\s*"[^"]*")*
\s*
/>
)|
## Open tag match.
(
<[\w:\-]+
(?:\s+[\w:\-]+\s*=\s*"[^"]*")*
\s*
>
)|
## Close tag match.
(</[\w:\-]+>)
It starts reading in data from the file input stream. For each buffer read, it tries to match the above pattern. If nothing matches, it goes back to the file for more data to add to the internal buffer. Then it tires to match the above pattern again (from the beginning of the newly enlarged buffer). It keeps doing this repeated read until it finds a match or until it run out of data. When it does find a match, it checks to see which group in the regular expression pattern was actually captured. From the captured group index, it can determine which type of XML node was found - Start tag, end tag, CDATA, etc.. It then chops that match off of the internal buffer, parses it into an actual XML element, triggers an XML event, and then goes back to the internal buffer to start the matching process over.
<cfcomponent
output="false"
hint="I parse an XML file a little bit at a time using a file input stream so that the whole file doesn't need to be read in and parsed at one time.">
<!--- Set up instance variables. --->
<cfset VARIABLES.Instance = {
<!--- This is the the event listern. --->
Listener = "",
<!---
This is the buffer that holds are data that is
pulled in from the input buffer but has not yet
been processed.
--->
Buffer = "",
<!---
The compiled pattern that we will be using to parse
the data that comes from the input stream.
--->
Pattern = ""
} />
<!---
Create a regular expression pattern to handle the
differnt node types. We are doing it here so we can
create an eaiser to read, verbose pattern.
--->
<cfsavecontent variable="VARIABLES.Instance.Pattern"
>(?xi)
## Make sure that we start off with a regular expression
## that is both verbose as well as case-insensitive.
## Start at the beginning of the string (this will be the
## beginning of our data buffer.
^
## Ignored by our event parser.
## Doc type / encoding property.
(
\s*
<\?[\w\W]+?\?>
\s*
)|
## CDATA character match.
(
\s*
<!\[CDATA\[
[\w\W]*?
\]\]>
\s*
)|
## Text node match.
([^<]+(?=<))|
## Self-closing tag match.
(
<[\w:\-]+
(?:\s+[\w:\-]+\s*=\s*"[^"]*")*
\s*
/>
)|
## Open tag match.
(
<[\w:\-]+
(?:\s+[\w:\-]+\s*=\s*"[^"]*")*
\s*
>
)|
## Close tag match.
(</[\w:\-]+>)
</cfsavecontent>
<!--- Now that we have the pattern, compile it. --->
<cfset VARIABLES.Instance.Pattern = CreateObject(
"java",
"java.util.regex.Pattern"
).Compile(
JavaCast(
"string",
Trim( VARIABLES.Instance.Pattern )
)
)
/>
<cffunction
name="Init"
access="public"
returntype="any"
output="false"
hint="I initialize and return this object.">
<!--- Define arguments. --->
<cfargument
name="Listener"
type="any"
required="true"
hint="I am the XML parse listener."
/>
<!--- Set listener. --->
<cfset VARIABLES.Instance.Listener = ARGUMENTS.Listener />
<!--- Return This reference. --->
<cfreturn THIS />
</cffunction>
<cffunction
name="AnnounceEvent"
access="private"
returntype="void"
output="true"
hint="I take an event and announce it to the listener if it is set.">
<!--- Define arguments. --->
<cfargument
name="EventName"
type="string"
required="true"
hint="I am the name of the event to announce."
/>
<cfargument
name="EventData"
type="any"
required="false"
default="#StructNew()#"
hint="I am the event data that we are passing to the event listener."
/>
<!--- Invoke event listener in the listener object. --->
<cfinvoke
component="#VARIABLES.Instance.Listener#"
method="#ARGUMENTS.EventName#">
<cfinvokeargument
name="1"
value="#ARGUMENTS.EventData#"
/>
</cfinvoke>
<!--- Return out. --->
<cfreturn />
</cffunction>
<cffunction
name="GetNextNode"
access="private"
returntype="any"
output="false"
hint="I get the next whole element from the input stream. That might be a open tag, close tag, or node text.">
<!--- Define arguments. --->
<cfargument
name="InputStream"
type="any"
required="true"
output="false"
hint="The buffered input stream for our XML file."
/>
<!--- Define the local scope. --->
<cfset var LOCAL = {} />
<!--- Create a pattern matcher on the internal buffer. --->
<cfset LOCAL.Matcher = VARIABLES.Instance.Pattern.Matcher(
JavaCast( "string", VARIABLES.Instance.Buffer )
) />
<!--- Check to see if we can find a node. --->
<cfif LOCAL.Matcher.Find()>
<!---
We found a node pattern. Let's check to see which
one we found. To do this, we are going to get
each group. Only one of them should be NOT NULL.
--->
<cfset LOCAL.IgnoredNode = LOCAL.Matcher.Group(
JavaCast( "int", 1 )
) />
<cfset LOCAL.CDATANode = LOCAL.Matcher.Group(
JavaCast( "int", 2 )
) />
<cfset LOCAL.TextNode = LOCAL.Matcher.Group(
JavaCast( "int", 3 )
) />
<cfset LOCAL.OpenCloseNode = LOCAL.Matcher.Group(
JavaCast( "int", 4 )
) />
<cfset LOCAL.OpenNode = LOCAL.Matcher.Group(
JavaCast( "int", 5 )
) />
<cfset LOCAL.CloseNode = LOCAL.Matcher.Group(
JavaCast( "int", 6 )
) />
<!---
Now that we have gotten our pattern groups out of
the match, let's check to see what node type we
are dealing with. Based on the node type, we can
return the proper group.
--->
<cfif StructKeyExists( LOCAL, "IgnoredNode" )>
<cfset LOCAL.Node = {
Type = "Ignored",
Node = LOCAL.IgnoredNode
} />
<cfelseif StructKeyExists( LOCAL, "CDATANode" )>
<cfset LOCAL.Node = {
Type = "Text",
Node = LOCAL.CDATANode
} />
<cfelseif StructKeyExists( LOCAL, "TextNode" )>
<cfset LOCAL.Node = {
Type = "Text",
Node = LOCAL.TextNode
} />
<cfelseif StructKeyExists( LOCAL, "OpenCloseNode" )>
<cfset LOCAL.Node = {
Type = "SelfClose",
Node = LOCAL.OpenCloseNode
} />
<cfelseif StructKeyExists( LOCAL, "OpenNode" )>
<cfset LOCAL.Node = {
Type = "Open",
Node = LOCAL.OpenNode
} />
<cfelseif StructKeyExists( LOCAL, "CloseNode" )>
<cfset LOCAL.Node = {
Type = "Close",
Node = LOCAL.CloseNode
} />
</cfif>
<!---
Before we return the node, trim the internal
buffer so we don't keep matching on the same
node pattern.
--->
<cfif (Len( VARIABLES.Instance.Buffer ) EQ Len( LOCAL.Node.Node ))>
<!--- Just reset the internal buffer. --->
<cfset VARIABLES.Instance.Buffer = "" />
<cfelse>
<!--- Trim the local buffer. --->
<cfset VARIABLES.Instance.Buffer = Right(
VARIABLES.Instance.Buffer,
(Len( VARIABLES.Instance.Buffer ) - Len( LOCAL.Node.Node ))
) />
</cfif>
<!--- Return the node. --->
<cfreturn LOCAL.Node />
<cfelse>
<!---
We did not find a node pattern. It is possible
that we need to get more data out of the input
stream before we can get that data. Therefore,
let's read more data into out local buffer.
--->
<!---
Create a buffer into which we will read the
buffered input stream. Let's make the buffer
about five megs. We are going to use a large
string and then get the underlying bytes to get
a byte buffer.
--->
<cfset LOCAL.Buffer = RepeatString( "12345", 1024 )
.GetBytes()
/>
<!--- Read input stream into local buffer. --->
<cfset LOCAL.BytesRead = ARGUMENTS.InputStream.Read(
LOCAL.Buffer,
JavaCast( "int", 0 ),
JavaCast( "int", ArrayLen( LOCAL.Buffer ) )
) />
<!---
Check to see if we read any bytes. If we didn't
then we have run out of data to read and cannot
possibly match any more node patterns; just
return void.
--->
<cfif (LOCAL.BytesRead EQ -1)>
<!---
No more data. Return VOID to signal that
we have run out of data.
--->
<cfreturn />
<cfelse>
<!---
We have read data in from the buffered input
stream. Now, let's append that to our
internal buffer. Be sure to only move over
the bytes that were read - this might not
include the whole buffer contents.
--->
<cfset VARIABLES.Instance.Buffer &= Left(
ToString( LOCAL.Buffer ),
LOCAL.BytesRead
) />
</cfif>
<!---
Now that we have updated our buffer, call the
method again to find the next node pattern.
--->
<cfreturn GetNextNode( ARGUMENTS.InputStream ) />
</cfif>
</cffunction>
<cffunction
name="Parse"
access="public"
returntype="void"
output="true"
hint="I parse the given XML file and announce events as I parse it.">
<!--- Define arguments. --->
<cfargument
name="FilePath"
type="string"
required="true"
hint="I am the expanded path to the given xml file."
/>
<!--- Define the local scope. --->
<cfset var LOCAL = {} />
<!---
Create a file input stream to read in the XML file
a bit at a time. We are going to create a buffered
input stream for efficiency.
--->
<cfset LOCAL.InputStream = CreateObject(
"java",
"java.io.BufferedInputStream"
).Init(
CreateObject(
"java",
"java.io.FileInputStream"
).Init(
JavaCast( "string", ARGUMENTS.FilePath )
)
)
/>
<!--- Announce the start document event. --->
<cfset AnnounceEvent( "StartDocument" ) />
<!---
Keep looping over the next reads until we get to a
point where we no longer have data (GetNextNode()
returns void).
--->
<cfloop condition="true">
<!--- Read the next node. --->
<cfset LOCAL.Node = GetNextNode( LOCAL.InputStream ) />
<!--- Check to see if we have a noce. --->
<cfif StructKeyExists( LOCAL, "Node" )>
<!--- Check to see if we are ignoring this node. --->
<cfif (LOCAL.Node.Type NEQ "Ignored")>
<!--- Parse the node into an actual structure. --->
<cfset LOCAL.ParsedNode = ParseNode(
LOCAL.Node.Type,
LOCAL.Node.Node
) />
<!---
Check to see which type of node was returned.
We will use thie node type (which is a pseudo
node type) to announce an event.
--->
<cfswitch expression="#LOCAL.Node.Type#">
<cfcase value="SelfClose">
<cfset AnnounceEvent( "StartNode", LOCAL.ParsedNode ) />
<cfset AnnounceEvent( "EndNode", LOCAL.ParsedNode ) />
</cfcase>
<cfcase value="Open">
<cfset AnnounceEvent( "StartNode", LOCAL.ParsedNode ) />
</cfcase>
<cfcase value="Text">
<cfset AnnounceEvent( "Text", LOCAL.ParsedNode ) />
</cfcase>
<cfcase value="Close">
<cfset AnnounceEvent( "EndNode", LOCAL.ParsedNode ) />
</cfcase>
</cfswitch>
</cfif>
<cfelse>
<!---
GetNextNode() returned void which means that
it has run out of data to read. Break out of
our conditional loop.
--->
<cfbreak />
</cfif>
</cfloop>
<!--- Announce the end document event. --->
<cfset AnnounceEvent( "EndDocument" ) />
<!---
Close the input stream. This will release the
file from the buffered reader and will free it
from locking.
--->
<cfset LOCAL.InputStream.Close() />
<!--- Return out. --->
<cfreturn />
</cffunction>
<cffunction
name="ParseNode"
access="private"
returntype="any"
output="false"
hint="I take an XML node string and parse it into an XML node.">
<!--- Define arguments. --->
<cfargument
name="NodeType"
type="string"
required="true"
hint="I am the type of node that was found."
/>
<cfargument
name="NodeString"
type="string"
required="true"
hint="I am a string representation of part of an XML node."
/>
<!--- Define the local scope. --->
<cfset var LOCAL = {} />
<!---
When we parse this node, we might have to kludge
the node text a bit in order to get it parsed by
CFXML. This include turning openNode types into
self closing tags.
--->
<cfswitch expression="#ARGUMENTS.NodeType#">
<cfcase value="SelfClose">
<!--- Leave as is. --->
</cfcase>
<cfcase value="Open">
<!--- Turn this node into a self-closing node. --->
<cfset ARGUMENTS.NodeString = REReplace(
ARGUMENTS.NodeString,
">$",
"/>",
"one"
) />
</cfcase>
<cfcase value="Text">
<!--- Leave as is. --->
</cfcase>
<cfcase value="Close">
<!--- Turn into self-closing node. --->
<cfset ARGUMENTS.NodeString = REReplace(
ARGUMENTS.NodeString,
"^</([^>]+)>$",
"<\1/>",
"one"
) />
</cfcase>
</cfswitch>
<!---
We have to be careful for nodes that have namespaces.
If we try to parse a node that uses a name space and
that namespace is not bound, we will get an error.
Therefore, we have to check to see if our node usses
any namespaces.
--->
<cfset LOCAL.NameSpaces = REMatch(
"^<[\w\d\-]+:[\w\d\-]+|[\w\d\-]+:[\w\d\-]+\s*=",
ARGUMENTS.NodeString
) />
<!---
Set a default name space string. This is the string
that will be added to the root node of our future
CFXML document.
--->
<cfset LOCAL.NameSpaceString = "" />
<!--- Loop over any name space matches. --->
<cfloop
index="LOCAL.NameSpace"
array="#LOCAL.NameSpaces#">
<!--- Clean the namespace. --->
<cfset LOCAL.NameSpace = ListFirst(
Replace( LOCAL.NameSpace, "<", "", "one" ),
":"
) />
<!---
Make sure we are not trying to bind the xmlns
prefix itself. We only want to find ones NOT in
the root document.
--->
<cfif (LOCAL.NameSpace NEQ "xmlns")>
<!--- Add this name space to our string. --->
<cfset LOCAL.NameSpaceString &= (
" xmlns:" &
LOCAL.NameSpace &
"=""http://www.domain.ext"""
) />
</cfif>
</cfloop>
<!---
At this point, we have converted our node string
into an XML node that could be properly parsed as
part of a new document. We have also extracted any
name spaces that need to be bound in this document.
Create a simple XML document with the above nested
element. We are using a place holder data2 to help
work around some CDATA limitations in CF XML Parsing.
--->
<cfxml variable="LOCAL.Xml">
<cfoutput>
<root #LOCAL.NameSpaceString#>
<data>#ARGUMENTS.NodeString#</data>
<data2>place-holder</data2>
</root>
</cfoutput>
</cfxml>
<!---
Get the child node. When we do this, we need to
treat element nodes differently than text nodes.
--->
<cfif (ARGUMENTS.NodeType EQ "Text")>
<!--- Get node text from dummy node. --->
<cfset LOCAL.ChildNodes = XmlSearch(
LOCAL.Xml,
"/root/data2/text()"
) />
<!---
Store actual value of text node into our
dummy node value. This will allow us to get
the actual text even though CDATA text is not
handled so well in ColdFusion.
--->
<cfset LOCAL.ChildNodes[ 1 ].XmlValue = LOCAL.Xml.root.data.XmlText />
<cfelse>
<!--- Get node. --->
<cfset LOCAL.ChildNodes = XmlSearch(
LOCAL.Xml,
"/root/data/*[ 1 ]/"
) />
</cfif>
<!--- Return the parsed node. --->
<cfreturn LOCAL.ChildNodes[ 1 ] />
</cffunction>
</cfcomponent>
This was a fun little piece of functionality to hack away at. It has a ton of short comings, such as not being able to handle comments or invalid XML data, but, like I said, this was really just a though experiment and little more than an excuse to use some regular expressions.
Want to use code from this post? Check out the license.
Reader Comments
Nice. Wow, you put a lot of work into that... more than I was expecting from contestants. Parsing XML with regular expressions is something I've wanted to take a stab at for a long time since it sounds like a lot of fun (even though there are established implementations and possibly better approaches out there). Again, really nice work.
As noted, my regex contest ( http://blog.stevenlevithan.com/archives/regexday-2008 ) ends today, so if anyone else wants to get in on the action, you'd better hurry. Entries can be as simple as a single regular expression that matches something interesting or creative.
Thanks Steve. This was fun. Aside from the fact that it can't handle a lot of things, it was much less complicated than I thought it would be.
I have written a RegExp based XML parser in PHP5. It was meant to parse a few XML nodes in large and complex XML streams, given an XPath-like selector (very simplified).
It had - and still has, the engine still being used in this old application I wrote - some (major ?) limitations and performance issues :
- No tolerance for invalid XML structure. Unknown XML entities where accepted if not in the selected node, cause they were not parsed until after the selection of the XML string code of the adequate node.
- Due to the engine "forward" behavior, the entire stream could be parsed in the worst case scenarii (repetitive structure and selected node in the end of the stream, "almost flat" XML, huge nested nodes, etc). Since I did not parse the entire structure, I found no way to sort/hash the nodes in order to fasten the search.
- The overall performance could be (much) poorer on slow (old) servers with some added memory, compared to the widely used XML parsers available in PHP5 (the very intense use of the proc by the PCRE engine did much more harm than the huge memory consumption of the common XML parsers).
Note : I do not know of this SAX XML parser thing ... If easily callable from a PHP script, it could be a good replacement.
Note 2 : Please excuse the wierd english sentences, if any, since I'm french.