PatternMatcher.cfc - A ColdFusion Component Wrapper For The Java Regular Expression Engine
ColdFusion is built on top of Java. This architecture not only gives us access to all the ColdFusion functionality, but also to all of the core Java libraries that lay just below the surface. Among those core libraries are the Pattern and Matcher classes contained within the java.util.regex package. These classes provide very power regular expression (RegEx) functionality; however, going from a loosely typed language (ColdFusion) to a strongly typed language (Java) can make communication somewhat arduous. As such, I wanted to wrap the Pattern/Matcher access in a ColdFusion component that would encapsulate all of the necessary data type conversions and extraneous mechanics.
To be honest, what I'm going to show you here is nothing very new. I have, many times before, blogged about ways in which to abstract access to the Java Pattern and Matcher classes. User defined functions (UDFs) like reSplit(), reMatchGroup(), reMatchGroups(), and reMultiMatch() all rely on the Java Matcher class internally. Likewise, ColdFusion custom tags like reLoop.cfm, rereplace.cfm, and re:replace.cfm allow for abstractions to both the access and mutation features of the Matcher class. Up until now, however, nothing that I've done has really been ColdFusion-component-based.
At first, I was going to create both a Pattern.cfc and a Matcher.cfc to correspond to the two underlying Java classes; however, what I realized was that in all of my regular expression usage, the Matcher class was really the object of primary use. The Pattern class was simply a stepping stone used to get to a Matcher instance. As such, I decided to narrow the scope of my wrapper down to a single ColdFusion component: PatternMatcher.cfc.
This ColdFusion component takes care of creating both the necessary Pattern class and Matcher class instances; but, it really only provides for an augmented subset of the underlying Matcher class.
- init( pattern, input ) :: any
- find() :: boolean
- group( [index [, default ]] ) :: string
- groupCount() :: numeric
- hasGroup( index ) :: boolean
- match() :: array
- matchGroups() :: array
- replaceAll( replacement [, quoteReplacement ] ) :: string
- replaceFirst( replacement [, quoteReplacement ] ) :: string
- replaceWith( replacement [, quoteReplacement ] ) :: any
- reset() :: any
- result() :: string
The replaceAll() and replaceFirst() methods are single operation methods - that is, they act on the input value and are done. Likewise, match() and matchGroups() are single operation methods - they collect matches from the input value and return the aggregated collection. find(), group(), and replaceWith(), on the other hand, are where things get a bit more interesting. These last three functions allow you to iterate over each pattern match within the given input string, collecting and/or replacing captured values on a point-by-point basis.
Before we look at the PatternMatcher.cfc code, let's take a look at some examples. The first will demonstrate the looping behavior afforded by the find() method:
<!--- Create an input text in which we will find patterns. --->
<cfsavecontent variable="input">
Sarah 212-555-1234 01/12/1980
Kim 212-555-9399 07/14/1975
Jenna 917-555-8712 05/05/1977
Tricia 646-555-9990 12/20/1974
</cfsavecontent>
<!---
Now, create a pattern to parse the above intput. We will be
looking for names, phone numbers, and birthdays. I'm going to
use a VERBOSE regular expression to explain the capture.
--->
<cfsavecontent variable="pattern">(?x)
## Group 1.
## Capture the name.
(\w+)
\s+
## Group 2.
## Capture the phone number.
(\d+ - \d+ - \d+)
\s+
## Group 3.
## Capture the date of birth (DOB).
(\d+ / \d+ / \d+)
</cfsavecontent>
<!---
Now, create a matcher to parse the input string and make
use of it.
--->
<cfset matcher = createObject( "component", "PatternMatcher" ).init(
pattern,
input
) />
<!--- Create an array to keep track of the records. --->
<cfset records = [] />
<!--- Keep looping over the input to find the matches. --->
<cfloop condition="matcher.find()">
<!---
Create a record for this set of matches. When each pattern
is matched, we are given access to each captured group through
the group() method.
NOTE: Group Zero (0) is always the entire pattern match, even
if there is no capturing group around the entire pattern.
--->
<cfset record = {
name = matcher.group( 1 ),
phoneNumber = matcher.group( 2 ),
dateOfBirth = matcher.group( 3 )
} />
<!--- Add the record to the record set. --->
<cfset arrayAppend(
records,
record
) />
</cfloop>
<!--- Output the matches. --->
<cfdump
var="#records#"
label="Parsed Record Data"
/>
Here, we are given a chunk of textual data that contains "person" records. Since each record (line of text) adheres to a pattern, we can use our PatternMatcher.cfc to parse each row into a data structure. Our pattern is comprised of three captured groups - name, phone number, and date-of-birth. As we loop over each match, you'll notice that we have access to each of those captured groups through the group() method.
When we run the above code, we get the following CFDump output:
As you can see, the match-iteration provided by the PatternMatcher.cfc made our text input easily transformable.
Gathering data is only half of the magic. Replacing matches is the other. This time, we'll use the same pattern and input; but, rather than aggregating the pattern matches, we'll replace them with altered text.
NOTE: This demo was run directly after the previous one; this is why we start off by reset()'ing the matcher and do not need to re-define our pattern or input values.
<!---
Now, let's imagine that we want to go through the input and XXX
out people's date of births. First we'll want to reset our
matcher.
--->
<cfset matcher.reset() />
<!--- Now, let's loop over the matches again. --->
<cfloop condition="matcher.find()">
<!---
When replacing, we can use the captured group back references
for the values that we DO want to keep. Remember, the first
group was the name, the second the phone number, and the
third was the date of birth.
--->
<cfset matcher.replaceWith(
"$1 $2 MM/DD/YYYY"
) />
</cfloop>
<!--- Output the result of the replacement. --->
<cfoutput>
<pre>
#matcher.result()#
</pre>
</cfoutput>
As you can see here, we are using the replaceWith() method to replace the current match with the given value. Part of our value uses back-references ($1, $2), which allow us to use captured groups within the replacement text. As we are performing these replacements, the PatternMatcher.cfc is building up an internal buffer; once we are done replacing values, we can then gain access to that internal buffer by using the result() method.
When we run the above code, we get the following output:
Sarah 212-555-1234 MM/DD/YYYY
Kim 212-555-9399 MM/DD/YYYY
Jenna 917-555-8712 MM/DD/YYYY
Tricia 646-555-9990 MM/DD/YYYY
As you can see, the date-of-birth was successfully "erased."
The find(), group(), replaceWith(), replaceFirst(), and replaceAll() methods really comprise the core Pattern/Matcher functionality from the Java layer. As an added bonus, I have also added two utility methods - match() and matchGroups() - which provide a more powerful alternative to ColdFusion's native reMatch() function.
<!--- Output all the matches of the given pattern. --->
<cfdump
var="#matcher.match()#"
label="Match()"
/>
Running the above code, we get the following output:
This works like the native reMatch() function; only, it gives you access to Java regular expression library which is much more robust.
The matchGroups() function works like the match() function; only, it breaks the individual matches down by captured group:
<!---
Output all the matches of the given pattern, broken down
by captured group.
--->
<cfdump
var="#matcher.matchGroups()#"
label="MatchGroups()"
/>
Running the above code, we get the following output:
Ok, enough exploration. Let's take a look at the PatternMatcher.cfc ColdFusion component:
PatternMatcher.cfc
<cfcomponent
output="false"
hint="I provide easier, implicitly type-cast access to the underlying Java Pattern and Matcher functionality.">
<cffunction
name="init"
access="public"
returntype="any"
output="false"
hint="I return an intialized component.">
<!--- Define arguments. --->
<cfargument
name="pattern"
type="string"
required="true"
hint="I am the Java-compatible regular expression to be used to create this pattern matcher."
/>
<cfargument
name="input"
type="string"
required="true"
hint="I am the input text over which we will be matching the above regular expression pattern."
/>
<!--- Define the local scope. --->
<cfset var local = {} />
<!--- Store the original values. --->
<cfset variables.pattern = arguments.pattern />
<cfset variables.input = arguments.input />
<!---
Compile the regular expression pattern and get the
matcher for the given input sequence.
--->
<cfset variables.matcher =
createObject( "java", "java.util.regex.Pattern" )
.compile( javaCast( "string", variables.pattern ) )
.matcher( javaCast( "string", variables.input ) )
/>
<!--- Create a buffer to store the replacement result. --->
<cfset variables.buffer = createObject( "java", "java.lang.StringBuffer" ).init() />
<!--- Return this object reference for method chaining. --->
<cfreturn this />
</cffunction>
<cffunction
name="find_"
access="public"
returntype="boolean"
output="false"
hint="I attempt to find the next pattern match located within the input string.">
<!--- Pass this request onto the matcher. --->
<cfreturn variables.matcher.find() />
</cffunction>
<cffunction
name="group"
access="public"
returntype="any"
output="false"
hint="I return the value captured by the given group. NOTE: Zero (0) will return the entire pattern match.">
<!--- Define arguments. --->
<cfargument
name="index"
type="numeric"
required="false"
default="0"
/>
<cfargument
name="default"
type="string"
required="false"
hint="I am the optional default to use if the given group (index) was not captured. Non-captured group references will return VOID. A default can be used to return non-void values."
/>
<!--- Define the local scope. --->
<cfset var local = {} />
<!--- Get the given group value. --->
<cfset local.capturedValue = variables.matcher.group(
javaCast( "int", arguments.index )
) />
<!---
Check to see if the given group was able to capture a
value (of if it did not, in which case, it will return
NULL, destroying the variable).
--->
<cfif structKeyExists( local, "capturedValue" )>
<!--- Return the captured value. --->
<cfreturn local.capturedValue />
<cfelseif structKeyExists( arguments, "default" )>
<!---
No group was captured, but a default value was
provided. Return the default value.
--->
<cfreturn arguments.default />
<cfelse>
<!---
No value was captured and no default was provided;
simply return VOID to the calling context.
--->
<cfreturn />
</cfif>
</cffunction>
<cffunction
name="groupCount"
access="public"
returntype="numeric"
output="false"
hint="I return the number of capturing groups within the regular exression pattern.">
<!--- Pass this request onto the matcher. --->
<cfreturn variables.matcher.groupCount() />
</cffunction>
<cffunction
name="hasGroup"
access="public"
returntype="any"
output="false"
hint="I determine whether or not the given group was captured in the previous match.">
<!--- Define arguments. --->
<cfargument
name="index"
type="numeric"
required="true"
/>
<!--- Define the local scope. --->
<cfset var local = {} />
<!--- Get the given group value. --->
<cfset local.capturedValue = variables.matcher.group(
javaCast( "int", arguments.index )
) />
<!---
Return whether or not the given captured group exists.
If it was captured, the value will exists; if it was not
captured, the given group value will be NULL (and hence
not exist).
--->
<cfreturn structKeyExists( local, "capturedValue" ) />
</cffunction>
<cffunction
name="match"
access="public"
returntype="array"
output="false"
hint="I return the collection of all pattern matches found within the given input. NOTE: This resets the internal matcher.">
<!--- Define the local scope. --->
<cfset var local = {} />
<!--- Reset the pattern matcher. --->
<cfset this.reset() />
<!---
Create an array in which to hold the aggregated
pattern matches.
--->
<cfset local.matches = [] />
<!--- Keep looping, looking for matches. --->
<cfloop condition="variables.matcher.find()">
<!--- Gather the current match. --->
<cfset arrayAppend(
local.matches,
variables.matcher.group()
) />
</cfloop>
<!--- Return the collected matches. --->
<cfreturn local.matches />
</cffunction>
<cffunction
name="matchGroups"
access="public"
returntype="array"
output="false"
hint="I return the collection of all pattern matches found within the given input, broken down by group. NOTE: This resets the internal matcher.">
<!--- Define the local scope. --->
<cfset var local = {} />
<!--- Reset the pattern matcher. --->
<cfset this.reset() />
<!---
Create an array in which to hold the aggregated
pattern matches.
--->
<cfset local.matches = [] />
<!--- Keep looping, looking for matches. --->
<cfloop condition="variables.matcher.find()">
<!--- Create a match object. --->
<cfset local.match = {} />
<!---
Move all of the captured groups into the match object
(with zero being the entire match).
--->
<cfloop
index="local.groupIndex"
from="0"
to="#variables.matcher.groupCount()#"
step="1">
<!--- Get the local value. --->
<cfset local.groupValue = variables.matcher.group(
javaCast( "int", local.groupIndex )
) />
<!---
Check to see if the value exists and only set it
if it does; ColdFusion seems to not like having
a NULL set into the struct (although it really
shouldn't have a problem with it).
--->
<cfif structKeyExists( local, "groupvalue" )>
<!--- Store the captured value. --->
<cfset local.match[ local.groupIndex ] = local.groupValue />
</cfif>
</cfloop>
<!--- Add the current match object. --->
<cfset arrayAppend(
local.matches,
local.match
) />
</cfloop>
<!--- Return the collected matches. --->
<cfreturn local.matches />
</cffunction>
<cffunction
name="replaceAll"
access="public"
returntype="string"
output="false"
hint="I replace all the pattern matches of the original input with the given value.">
<!--- Define arguments. --->
<cfargument
name="replacement"
type="string"
required="true"
hint="I am the string with which we are replacing the pattern matches."
/>
<cfargument
name="quoteReplacement"
type="boolean"
required="false"
default="false"
hint="I determine whether or not the replacement value should be quoted (this will escape any back reference values)."
/>
<!--- Check to see if we are quoting the replacement. --->
<cfif arguments.quoteReplacement>
<!--- Quote the replacement string. --->
<cfreturn variables.matcher.replaceAll(
variables.matcher.quoteReplacement(
javaCast( "string", arguments.replacement )
)
) />
<cfelse>
<!--- Use the replacement text as-is. --->
<cfreturn variables.matcher.replaceAll(
javaCast( "string", arguments.replacement )
) />
</cfif>
</cffunction>
<cffunction
name="replaceFirst"
access="public"
returntype="string"
output="false"
hint="I replace the first pattern matche of the original input with the given value.">
<!--- Define arguments. --->
<cfargument
name="replacement"
type="string"
required="true"
hint="I am the string with which we are replacing the first pattern match."
/>
<cfargument
name="quoteReplacement"
type="boolean"
required="false"
default="false"
hint="I determine whether or not the replacement value should be quoted (this will escape any back reference values)."
/>
<!--- Check to see if we are quoting the replacement. --->
<cfif arguments.quoteReplacement>
<!--- Quote the replacement string. --->
<cfreturn variables.matcher.replaceFirst(
variables.matcher.quoteReplacement(
javaCast( "string", arguments.replacement )
)
) />
<cfelse>
<!--- Use the replacement text as-is. --->
<cfreturn variables.matcher.replaceFirst(
javaCast( "string", arguments.replacement )
) />
</cfif>
</cffunction>
<cffunction
name="replaceWith"
access="public"
returntype="any"
output="false"
hint="I replace the current match with the given value. NOTE: Back references within the replacement string will be honored unless the replacement value is quoted (see second arguemnt).">
<!--- Define arguments. --->
<cfargument
name="replacement"
type="string"
required="true"
hint="I am the value with which we are replacing the previous match."
/>
<cfargument
name="quoteReplacement"
type="boolean"
required="false"
default="false"
hint="I determine whether or not the replacement value should be quoted (this will escape any back reference values)."
/>
<!--- Check to see if we are quoting the replacement. --->
<cfif arguments.quoteReplacement>
<!--- Quote the replacement value before you use it. --->
<cfset variables.matcher.appendReplacement(
variables.buffer,
variables.matcher.quoteReplacement(
javaCast( "string", arguments.replacement )
)
) />
<cfelse>
<!--- Use raw replacement value. --->
<cfset variables.matcher.appendReplacement(
variables.buffer,
javaCast( "string", arguments.replacement )
) />
</cfif>
<!--- Return this object reference for method chaining. --->
<cfreturn this />
</cffunction>
<cffunction
name="reset"
access="public"
returntype="any"
output="false"
hint="I reset the pattern matcher.">
<!--- Define arguments. --->
<cfargument
name="input"
type="string"
required="false"
hint="I am the optional input with which to reset the pattern matcher."
/>
<!--- Check to see if a new input is being used. --->
<cfif structKeyExists( arguments, "input" )>
<!--- Use a new input to reset the matcher. --->
<cfset variables.matcher.reset(
javaCast( "string", arguments.input )
) />
<!--- Store the input property. --->
<cfset variables.input = arguments.input />
<cfelse>
<!--- Reset the internal matcher. --->
<cfset variables.matcher.reset() />
</cfif>
<!--- Reset the internal results buffer. --->
<cfset variables.buffer = createObject( "java", "java.lang.StringBuffer" ).init() />
<!--- Return this object reference for method chaining. --->
<cfreturn this />
</cffunction>
<cffunction
name="result"
access="public"
returntype="string"
output="false"
hint="I return the result of the replacement up until this point.">
<!---
Since we are no longer dealing with replacements,
append the rest of the unmatched input string to the
results buffer.
--->
<cfset variables.matcher.appendTail(
variables.buffer
) />
<!--- Return the resultand string. --->
<cfreturn variables.buffer.toString() />
</cffunction>
<!--- ------------------------------------------------- --->
<!--- ------------------------------------------------- --->
<!---
Swap some of the method names; we couldn't name it "find"
to begin with otherwise we'd get a ColdFusion error for
conflicting with a native function name.
--->
<cfset this.find = this.find_ />
</cfcomponent>
Java's Pattern and Matcher classes are, without a doubt, amazing. They provide for a very efficient, very robust regular expression engine - much more powerful than the one at the native ColdFusion level. I'm always looking for ways to make using these classes easier. Creating a ColdFusion component wrapper might just be the easiest approach yet.
Want to use code from this post? Check out the license.
Reader Comments
No words to say, only "fantastic"...
Great script Ben, thanks!
@Christiano,
Thank you my good man :)
Going in my utility package now. Thanks.
@John,
Awww yeaah!!
With out doubt the coolest concept this year so far, and I'd not be surprised to be saying the same come Xmas. Genius! Thanks once again for the inspiration.
@Ian,
Thanks my man - I really appreciate that kind of feedback.
Now I can use Zero-width Negative Lookbehind expressions that REReplaceNoCase threw up on.
Thanks, Ben
@Wade,
Exactly! This just picks up where reReplace() left off :D
@All,
I rewrote this component using CFScript and moved it over to GitHub:
www.bennadel.com/blog/2524-PatternMatcher-cfc-Now-On-GitHub.htm
Some of the methods have changed; but, it's basically the same; and, now it now has unit tests!