RELoop ColdFusion Custom Tag To Iterate Over Regular Expression Patterns
Yesterday, when I was working on my GarbleText() ColdFusion user defined function, I begrudged the fact that I was, for the hundredth time, writing a Java Pattern Matcher and then looping over it and messing with String Buffers. This stuff is just so routine and boiler plate (if you work with Java regular expressions)! Then I had this awesome idea - take that functionality and make it into a ColdFusion custom tag.
And so, I came up with RELoop.cfm. This ColdFusion custom tag is used to take a regular expression patterns and loop over a target text. It can return the pattern matches as a single string or a structure; it can even update the text and return it back into a variable. Here are the possible attributes:
Index
This is the CALLER-scoped variable into which we will store the contextual match. This may be a string or a struct depending on whether the user wants the Groups returned.
Text
This is the text in which we will be searching for and iterating over the pattern matches.
Pattern
This is the regular expression pattern that we will be matching. Since we using the Java regular expression engine, this goes by the java.util.regex.Pattern syntax, not necessarily by the ColdFusion regex syntax (I am not sure what the ColdFusion to Java compatibility issues are. I do know that Java supports more powerful regular expressions).
Variable
This is the CALLER-scoped variable into which the resulting string (after the patterns have been replaced) will be placed.
ReturnGroups
This flags whether to return the single matched pattern or to return a structure with the set of groups returned. The structure stores the entire match in the "0" key. It then stores each captured group in the one-based index key. Additionally, it returns the number of captured groups in the GroupCount key.
The only required attributes are Index, Text, and Pattern. The existence of the other variables will enhance the functionality of the RELoop.cfm ColdFusion custom tag.
Ok, now that we see the attributes, let's take a look at some examples. First, we are going to start out building the target text and the regular expression we want to use. The regular expression will be used to loosely find email addresses (don't concentrate on the regular expression - it's just a detail):
<!---
Save some text that we want to explore using
regular expressions.
--->
<cfsavecontent variable="strText">
Hey campers, Mistress Kim here. I'll be out of town for
a few days at my annual hedonist retreat. Email my friends
sandi@mistress-kims.com or michelle@mistress-kims.com if you
really need to reach out and touch someone and you simply
can't wait for me to come back...
</cfsavecontent>
<!---
Set up the Java regular expression for matching email
addresses. We are going to capture each part of the email
in a different group.
--->
<cfset strRegEx = (
"(?i)" &
"([\w\.\-_+]+)" &
"@" &
"([\w\.\-_]+)" &
"\." &
"(\w{2,9})"
)/>
Now that we have that established, let's start out with a basic example. In this snippet, we are simply going to loop over the text and output the index value which contains the matched email addresses:
<!---
Loop over the text using the email address pattern. For
each iteration, we are just gonna use the index to output
the individual email addresses.
--->
<cfmodule
template="reloop.cfm"
index="strMatch"
pattern="#strRegEx#"
text="#strText#">
#strMatch#<br />
</cfmodule>
Running this code, we get the following output:
sandi@mistress-kims.com
michelle@mistress-kims.com
Above, we got the pattern match as a string, which is the default behavior of the RELoop.cfm ColdFusion custom tag. By setting the ReturnGroups attribute to true, we have the tag return the match as a group structure which separates the entire pattern, the individual matched groups, and the group count. This time, we are going to request the group structure and then just CFDump it out so you can see what's going on:
<!---
Loop over the text using the email address pattern. For
each iteration, we are gonna use the ReturnGroups boolean
to get a structure of the captured groups.
--->
<cfmodule
template="reloop.cfm"
index="objMatch"
pattern="#strRegEx#"
text="#strText#"
returngroups="true">
<!--- Dump out the match structure. --->
<cfdump
var="#objMatch#"
label="Pattern Match"
/>
</cfmodule>
Running the above code, we get the two CFDump outputs:
As you can see, each pattern match is returned in an easy to use structure.
But this ColdFusion custom tag can do more than just extract and return regular expression pattern matches. It can also update the text matched by the pattern. For the following two examples, we are going to find the email addresses and replace the COM domain extension with the ORG domain extension. In order to get the tag to operate in this manner, you have to define the Variable tag attribute. This attribute is the name of the variable into which you want to store the result.
This replacement works for both the simple string matches as well as the group matches. For the simple string match, all you have to do is update the Index value:
<!---
Loop over the text using the email address pattern. For
each iteration, we are gonna use the Variable attribute to
flag that we actually want to update the target string.
Because of this flag, updates to the INDEX variable will
be integrated back into the new string.
--->
<cfmodule
template="reloop.cfm"
index="strMatch"
pattern="#strRegEx#"
text="#strText#"
variable="strUpdatedText">
<!---
For each email address, we want to change the COM domain
extension to be the ORG extension. To do that, we have
to update the INDEX value.
--->
<cfset strMatch = REReplace(
strMatch,
"\.\w+$",
".org",
"ONE"
) />
</cfmodule>
<!---
Now that we have finished looping, our Variable,
strUpdatedText, should now have the updated text
value.
--->
#strUpdatedText#
Notice that we are just using REReplace() to swap out COM with ORG and that we are storing the resultant string into the variable strUpdatedText. Running the above code, we get the following output:
Hey campers, Mistress Kim here. I'll be out of town for a few days at my annual hedonist retreat. Email my friends sandi@mistress-kims.org or michelle@mistress-kims.org if you really need to reach out and touch someone and you simply can't wait for me to come back...
Notice that the email addresses were successfully swapped and all we had to do was update the match value. The same works for the structured group match; the only difference is that instead of updating the Index itself, we are updating the ZERO group of the structure (which represents the entire matched value):
<!---
Loop over the text using the email address pattern. For
each iteration, we are gonna use the ReturnGroups boolean
to get a structure of the captured groups. Then, we are
going to update the ZERO group (full match) in conjunction
with the Variable attribute to flag that we actually want
to update the target string.
--->
<cfmodule
template="reloop.cfm"
index="objMatch"
pattern="#strRegEx#"
text="#strText#"
returngroups="true"
variable="strUpdatedText">
<!---
For each email address, we want to change the COM domain
extension to be the ORG extension. To do that, we have
to update the ZERO group within the returned structure.
--->
<cfset objMatch[ 0 ] = (
objMatch[ 1 ] & "@" &
objMatch[ 2 ] & "." &
"org"
) />
</cfmodule>
<!---
Now that we have finished looping, our Variable,
strUpdatedText, should now have the updated text
value.
--->
#strUpdatedText#
As you can see, we are updating the replacement text via the key, objMatch[ "0" ]. Running the above code, we get the following output:
Hey campers, Mistress Kim here. I'll be out of town for a few days at my annual hedonist retreat. Email my friends sandi@mistress-kims.org or michelle@mistress-kims.org if you really need to reach out and touch someone and you simply can't wait for me to come back...
Works just as expected.
I really like this tag because not only does it encapsulates the boiler plate Java Pattern Matcher functionality, but it creates a really easy interface for updating text with complex logic. Sure, if you just wanted to replace COM with ORG, a simply REReplace() would have worked; but, think about what if you needed to do some conditional updates? Or gather the matches. This tag merges the functionality of things like the new ColdFusion 8 function REMatch() and ColdFusion custom tag REExtract.cfm, giving you the flexibility to perform both actions.
Anyway, here is the code behind the tag:
<!--- Kill extra output. --->
<cfsilent>
<!---
Check to see which execution mode the tag is running
in. In the start mode, we will param the tag attributes
and start the loop. In the end mode, we will finish
executing the loop.
--->
<cfswitch expression="#THISTAG.ExecutionMode#">
<cfcase value="Start">
<!--- Param the tag attributes. --->
<!---
This is the CALLER-scoped variable into which we
will store the contextual match. This may be a
string or a struct depending on whether the user
wants the Groups returned.
--->
<cfparam
name="ATTRIBUTES.Index"
type="variablename"
/>
<!---
This is the text in which we will be searching
for and iterating over the pattern matches.
--->
<cfparam
name="ATTRIBUTES.Text"
type="string"
/>
<!---
This is the regular expression pattern that
we will be matching. Since we using the Java
regular expression engine, this goes by the
java.util.regex.Pattern syntax, not
necessarily by the ColdFusion regex syntax.
--->
<cfparam
name="ATTRIBUTES.Pattern"
type="string"
/>
<!---
This is the CALLER-scoped variable into which
the resulting string (after the patterns have
been replaced) will be placed. If this value
is not renamed, then no value will be passed
back to the caller.
--->
<cfparam
name="ATTRIBUTES.Variable"
type="variablename"
default="undefined"
/>
<!---
This flags whether to return the single matched
pattern or to return a structure with the set of
groups returned.
--->
<cfparam
name="ATTRIBUTES.ReturnGroups"
type="boolean"
default="false"
/>
<!---
ASSERT: At this point, all of the tag attributes
have been properly validated.
--->
<!---
Set a short-hand flag to test for variable
return (based on the current name).
--->
<cfset THISTAG.HasReturn = (ATTRIBUTES.Variable NEQ "undefined") />
<!--- Create the compiled pattern object. --->
<cfset THISTAG.Pattern = CreateObject(
"java",
"java.util.regex.Pattern"
).Compile(
JavaCast(
"string",
ATTRIBUTES.Pattern
)
)
/>
<!---
Get the pattern matcher for our target text from
the compiled pattern.
--->
<cfset THISTAG.Matcher = THISTAG.Pattern.Matcher(
JavaCast(
"string",
ATTRIBUTES.Text
)
) />
<!---
Get the string buffer into which we will store
the replaced text. However, we only care about
this if the user wants the value back.
--->
<cfif THISTAG.HasReturn>
<!--- Create string buffer. --->
<cfset THISTAG.Buffer = CreateObject(
"java",
"java.lang.StringBuffer"
).Init()
/>
</cfif>
<!---
Find the first pattern match. Since the
Find() function returns a boolean flagging
that a match was found, check to see if it
returns true.
--->
<cfif THISTAG.Matcher.Find()>
<!---
Now that we found the first match, we need
to check to see how the user wants the
returned match - as a single string or as a
group structure.
--->
<cfif ATTRIBUTES.ReturnGroups>
<!---
The user wants the group structure to be
returned so create a structure to hold
this data.
--->
<cfset THISTAG.Index = StructNew() />
<!---
Store the number of matching groups in
the regular expression.
--->
<cfset THISTAG.Index.GroupCount = THISTAG.Matcher.GroupCount() />
<!---
Store the index of the match. Add one
to get it to be a one-based index like
ColdFusion.
--->
<cfset THISTAG.Index.Start = (THISTAG.Matcher.Start() + 1) />
<!--- Store the entire match. --->
<cfset THISTAG.Index[ "0" ] = THISTAG.Matcher.Group() />
<!---
Loop over the matching groups and store
them via indexes into the Index object.
--->
<cfloop
index="THISTAG.GroupIndex"
from="1"
to="#THISTAG.Index.GroupCount#"
step="1">
<cfset THISTAG.Index[ "#THISTAG.GroupIndex#" ] = THISTAG.Matcher.Group(
JavaCast(
"int",
THISTAG.GroupIndex
)
) />
</cfloop>
<!---
Store new index object into the CALLER-
scoped index variable.
--->
<cfset "CALLER.#ATTRIBUTES.Index#" = THISTAG.Index />
<cfelse>
<!---
The user just wants a string. Store
the entire matching substring into the
CALLER-scoped index variable.
--->
<cfset "CALLER.#ATTRIBUTES.Index#" = THISTAG.Matcher.Group() />
</cfif>
<cfelse>
<!---
No matching pattern was found. We need to
full exit the tag without doing any looping.
However, we may need to store the value back
into the caller. Check to see if we have a
return value.
--->
<cfif THISTAG.HasReturn>
<!---
Just store the submitted text back into
the CALLER scope.
--->
<cfset "CALLER.#ATTRIBUTES.Variable#" = ATTRIBUTES.Text />
</cfif>
<!--- Exit out of tag execution. --->
<cfexit method="exittag" />
</cfif>
</cfcase>
<cfcase value="End">
<!---
Now that the user has had a chance to update
the matching group values, let's see if they
are looking for a return. If they are, then
we need to replace the groups back into the
string buffer.
--->
<cfif THISTAG.HasReturn>
<!---
Now, we have to see how the user was
viewing the groups (as a single string
or a group structure).
--->
<cfif ATTRIBUTES.ReturnGroups>
<!---
When we are dealing with the group
structure, we only care about the entire
matched string (the zero group). Store
that back into the Index so that we can
deal with the replacement uniformly.
--->
<cfset THISTAG.Index = CALLER[ ATTRIBUTES.Index ][ "0" ] />
<cfelse>
<!---
Store the replacement string into the
local Index so that we can deal with
the replacement uniformly.
--->
<cfset THISTAG.Index = CALLER[ ATTRIBUTES.Index ] />
</cfif>
<!---
Replace the string into the buffer. When
appending the replacement, be sure to escape
any special characters.
--->
<cfset THISTAG.Matcher.AppendReplacement(
THISTAG.Buffer,
JavaCast(
"string",
THISTAG.Index.ReplaceAll(
JavaCast( "string", "([\\\$])" ),
JavaCast( "string", "\$1" )
)
)
) />
</cfif>
<!---
ASSERT: At this point, we have dealt with all
the changes that the user made to the matched
patterns. Now, we can deal with the next
match iteration.
--->
<!---
Find the next pattern match. Since the
Find() function returns a boolean flagging
that a match was found, check to see if it
returns true.
--->
<cfif THISTAG.Matcher.Find()>
<!---
Now that we found the next match, we need
to check to see how the user wants the
returned match - as a single string or as a
group structure.
--->
<cfif ATTRIBUTES.ReturnGroups>
<!---
The user wants the group structure to be
returned so create a structure to hold
this data.
--->
<cfset THISTAG.Index = StructNew() />
<!---
Store the number of matching groups in
the regular expression.
--->
<cfset THISTAG.Index.GroupCount = THISTAG.Matcher.GroupCount() />
<!---
Store the index of the match. Add one
to get it to be a one-based index like
ColdFusion.
--->
<cfset THISTAG.Index.Start = (THISTAG.Matcher.Start() + 1) />
<!--- Store the entire match. --->
<cfset THISTAG.Index[ "0" ] = THISTAG.Matcher.Group() />
<!---
Loop over the matching groups and store
them via indexes into the Index object.
--->
<cfloop
index="THISTAG.GroupIndex"
from="1"
to="#THISTAG.Index.GroupCount#"
step="1">
<cfset THISTAG.Index[ "#THISTAG.GroupIndex#" ] = THISTAG.Matcher.Group(
JavaCast(
"int",
THISTAG.GroupIndex
)
) />
</cfloop>
<!---
Store new index object into the CALLER-
scoped index variable.
--->
<cfset "CALLER.#ATTRIBUTES.Index#" = THISTAG.Index />
<cfelse>
<!---
The user just wants a string. Store
the entire matching substring into the
CALLER-scoped index variable.
--->
<cfset "CALLER.#ATTRIBUTES.Index#" = THISTAG.Matcher.Group() />
</cfif>
<!--- Exit the tag as a loop. --->
<cfexit method="loop" />
<cfelse>
<!---
No matching pattern was found. Therefore,
we have exhausted all of the matches in
this string. We need to full exit the tag;
however, we may need to store the value back
into the caller. Check to see if we have a
return value.
--->
<cfif THISTAG.HasReturn>
<!---
Append the rest of the target text to
the string buffer.
--->
<cfset THISTAG.Matcher.AppendTail(
THISTAG.Buffer
) />
<!---
Convert the string buffer into a
single string and store it back into
the CALLER-scoped variable.
--->
<cfset "CALLER.#ATTRIBUTES.Variable#" = THISTAG.Buffer.ToString() />
</cfif>
<!--- Exit out of tag execution. --->
<cfexit method="exittag" />
</cfif>
</cfcase>
</cfswitch>
</cfsilent>
Let me know if you can think of any ways to improve this functionality.
Want to use code from this post? Check out the license.
Reader Comments
Just curious as to why you implemented this as a custom tag instead of a CFC?
Also, I haven't looked at it, but does the new ReMatch() function in CF8 do something similar?
@Brian,
Re: custom tag vs. cfc
I think the custom tag syntax is much more conducive to looping. If this sort of functionality was wrapped in a CFC, it would need to have at the very least a conditional CFLoop wrapped around the CFC calls. At that point, I think it makes more sense to just build that right into the custom tag functionality.
Re: REMatch() in ColdFusion 8
The REMatch() tag returns an array of matches. This is awesome and is a HUGE benefit to the language, but my custom tag allows you to not only get the matches, but also to update the matched patterns (in the context of the original string). My tag also returns the offset of each pattern match as well as gives you the option to returns the groups broken up into a structure.
I am not saying one is better than the other - they are different beasts meant as solutions to different problems.
Uh oh, looks like Ben is moving to the dark side. First he starts on how to bypass spam filters. Now he's scraping email addresses. Next is a spider, followed up buy a full on spamminator.
I sense a disturbance in the Force.
:0
hmm...I think I would have just passed the text string along with the arguments into the CFC method, and had the method return an array containing the structures or matched strings, or the modified text string. In other words, I'd do all the looping within the method call. What you've done is very cool, but it seems like the first thing people (or at least I) would want to do is take the custom tag results and put them into some sort of container (which is what a method call would give you back).
Just to give you an idea of what I'm thinking (and I know one answer is for me to just create this myself, but we're just kicking around ideas here):
REUtils.getStringMatches(text, regex)
returns an array of strings
REUtils.getPatternMatches(text, regex)
returns an array of pattern match structures
REUtils.getMatches(text, regex)
returns an array of structs, where one key is the string and the other is the pattern match stucture
REUtils.updateByString(text, regex, replacementText)
returns a text string with matches replaced by the replacementText
REUtils.updateByRegex(text, regex, replacementRegex)
returns a text string with matches replaced using the replacementRegex
@Brian,
I don't think I understand what the functions are doing exactly. Let's say that I had to replace all static URLs with A tags. Some of the links had http:// and some did not. If I wanted to conditionally add http:// to the A tag, I would put that logic inside my CFModule loop.
To do the same thing, I think you are saying that you would:
1. Get an array of matches.
2. Iterate over the array to update the text (in the array).
3. Pass back that array into another function that returns the updated string.
Am I following?
If I'm following you correctly, couldn't you just call this twice?
REUtils.updateByRegex(text, regexForHTTPInLink, replacementRegex)
REUtils.updateByRegex(text, regexForHTTPNotInLink, replacementRegex)
And again, if I follow, if the concern is handling any number of conditional replacements, a method could be created that give you back the array of pattern match structs, which you could then loop over and modify however you need to, then pass it back to the CFC and it would loop over that array and make each modification to the string, and then return the updated string.
Personally this sounds like an edge case to me (but you're using it and I'm not, so obviously I could be wrong). It seems that just the ability to take a string and give you back an updated string based on a single regex match pattern and replacement value would be far and away the most common use for this.
Actually, on second thought, I'd say the ability to get back a nice array of all of the actual matched values is probably the real winner, since doing that in CF is normally quite a pain. Though I haven't tried out the ReMatch() function yet, I assume that makes things easier.
@Brian,
Ha ha... I am laughing at US, because for what you just listed above, we basically just redefined REReplace() - two calls to it to be accurate :)
But you are right - I think for most cases, simple method calls such as your would be the most efficient problem.
I think what I was thinking of was my GarbledText() algorithm in which, for each matched pattern, I then had to scramble the letters of the match and replace those back into target text. I guess this is the "edge case" that I feel the Custom Tag approach might just be more natural and efficient as people are both used to the tag-based looping protocol and since it combines the matching and the replace into a single loop rather than multiple loops.
Ok, so maybe my tag is not that useful - oh well. I think it is a good idea, but upon reflection, only for a few edge cases.
Thanks for forcing me to step back and re-evaluate.
@Brian,
Yeah getting back an array is really the big winner :) That is exactly what REMatch() does. The only caveat is that REMatch() still uses the old RegEx engine. The only benefit of rolling your own solution is that you can leverage the Java regular expression engine which is faster and more powerful (in my experience).
To be clear, I'm not diminishing what you created, it's very cool. I'm just spouting ideas off the top of my head! :-)
Yeah, I thought it was cool too... till this Brian guy came along and walked all over it ;)
Just kidding, but seriously, sometimes I get so deep into my ideas that I don't even step back and take stock of how useful it it. I think it is very valuable that you can come along and question alternate and potentially better ways to do this. So, thanks for that.
Heh, you're right. Two straight ReReplace() calls would do it for that example. And I just tried out ReMatch() and it does give back an array of the matching strings which is very nice. So the other thing this would let you do that you can't do now is use another Regex as the actual replacement value instead of just a string.
And yes one more and then I'll shut up. Allowing you to feed in a regex as the replacement value would let you define a regex with alternatives. So as long as there weren't too many variations, you could still do it all in one method call. Anyway, cool stuff!
@Brian,
Thanks for taking the time to knock around some ideas.
By the way, who's Mistress Kim? ;-)
Ha ha, I just made it up. Actually, I didn't even check to see if the domain existed (I am at work and if it DID exist, I probably wouldn't want to pull it up at my desk).
Returning an array of matches is all well and good, but I agree with Ben that his custom tag is awesome, despite the availability of the craptastic REMatch function in CF8.
1. Easy access to Java's more powerful regular expression library -- awesome.
2. Easy use of Java's pattern matcher object iteration functionality -- awesome.
3. Something similar to many languages' ability to use a function to generate replacement strings (e.g., using a closure function with the replace() method in JavaScript) -- freaking awesome.
That's a lot of awesome.
This stuff is not corner case, or at least it shouldn't be.
@Steve,
I think your point, #3, about the anonymous function that can be used in the Javascript replace() function was more or less what I was going for. I find that so powerful in Javascript and I wanted to find a way to, for each match, update the string and return it back into the target string as I would/could in Javascript.
Glad you find this tag interesting.