Using The Regular Expression Boundary Match \G To Find The End Of The Previous Match
A couple of years ago, Steve Levithan - Regular Expression ninja and co-author of the Regular Expression Cookbook - helped me with a regular expression pattern for parsing comma-separated values (CSV). In the pattern, he made use of a RegEx boundary character that I had not seen before: \G. This boundary character matches the end of the previous match, preventing the parsing engine from skipping characters between matches. While I mostly understood his explanation, I thought it was time to look into this RegEx construct myself.
At its core, ColdFusion provides access to two different regular expression engines: POSIX and Java. The POSIX regular expression engine is what powers the built-in RegEx functions like reFind(), reMatch(), and reReplace(). The Java regular expression engine is made available to us at the Java layer via createObject(java.util.regex.Pattern).
These two engines have a lot in common and a lot that is unique to each other. The \G boundary match appears to be one of those things that is more or less unique to each engine. While the POSIX engine does have some basic support for \G, I have found it to be buggy and, in some cases, capable of causing a stack overflow error.
To see the \G regular expression boundary match in action, I wanted to use it in a common use case - parsing the missingMethodName argument passed into the onMissingMethod() ColdFusion event handler. I will try this using both the POSIX and the Java regular expression engines:
<!--- Mimic a missingMethodName argument. --->
<cfset missingMethodName = "getPhoneNumber" />
<!--- ----------------------------------------------------- --->
<!--- ----------------------------------------------------- --->
<cfsavecontent variable="patternText">(?x)
## The \G here tells the pattern matching engine to ensure
## that the current match starts and the end of the last
## match - no characters can be skipped between matches.
##
## The very first match must be at the beginning of the string
## (which is at position zero).
\G
(
## Greater strings starts with "get" or "set".
^( get | set )
|
## Greater string ends with word characters.
\w+$
)
</cfsavecontent>
<!--- ----------------------------------------------------- --->
<!--- ----------------------------------------------------- --->
<!---
First, we're going to try using this pattern with the POSIX
regular expression engine - the one that powers core ColdFusion
functions like reFind() and reMatch().
--->
<!--- Parse the missing method name. --->
<cfset nameParts = reMatch( patternText, missingMethodName ) />
<!--- Output matches. --->
<cfoutput>
POSIX: [ #arrayToList( nameParts, " ] [ " )# ]
</cfoutput>
<br />
<!--- ----------------------------------------------------- --->
<!--- ----------------------------------------------------- --->
<br />
<!---
Next, we're going to try the Java regular expression engine
using the Pattern class.
--->
<!--- Get a matcher for the compiled pattern. --->
<cfset matcher = createObject( "java", "java.util.regex.Pattern" )
.compile( javaCast( "string", patternText ) )
.matcher( javaCast( "string", missingMethodName ) )
/>
<cfoutput>
JAVA:
<!--- Find each match in the target string. --->
<cfloop condition="matcher.find()">
[ #matcher.group()# ]
</cfloop>
</cfoutput>
In this demo, I am using a Verbose regular expression (as defined by the ?x mode flag). This mode allows us to add white space and comments to the pattern for clarity and readability. As you can see, my \G boundary is the very first part of my pattern. This ensures that each match starts at the end of the previous match (or at the beginning of the string - position zero - for the first match). The meat of the pattern then matches either the get/set action or (|) the component property being accessed (ex. phoneNumber).
When I run this code through the two different regular expression engines, I get the following output:
POSIX: [ get ]
JAVA: [ get ] [ PhoneNumber ]
As you can see, the POSIX engine (reMatch() in this case) didn't quite like the use of \G. The Java regular expression engine, on the other hand, had no problem with it at all.
When I first started experimenting with the \G boundary match, I tried using this pattern:
^(get|set)|\G\w+$
Here, the \G is a possible part of the match, not a definite part of the match. While this worked in the Java regular expression engine, it caused the following ColdFusion error in the POSIX engine:
ROOT CAUSE: java.lang.OutOfMemoryError: Java heap space
While the Java engine was able to handle this particular pattern, from what I have read, it seems to be a best practice to always put the \G boundary match at the beginning of the regular expression pattern (as I have done in my main demo).
Regular expressions are both awesome and powerful. Every time that I learn a little something new in the pattern matching world, it tends to open up more possibilities for problem solving. I'm glad that I finally took the time to look deeper into the \G boundary matcher.
Want to use code from this post? Check out the license.
Reader Comments
Hi Ben,
Display idea "No one has left a comment is nice", lol
@CFFan,
Thanks - I thought it was fun :)
I must admit I still do not understand the expression quite well though reread the post few times.
Regex expressions are nightmare. Regarding CSV file - it would take few lines of Java or C++ code. I will try to read this post with fresh head. Anyway thank you for post.
PS: You have to be proud of yourself: I am C++/Java programmer with more than 10 years of experience :)
Oleg
http://xkcd.com/1171/
@Paul,
Ha ha, what can I say - I love me some RegEx :D