When To Use \N And $N As Regular Expression Back-References
The other day, I was talking to Ryan Jeffords about regular expressions (RegEx) and there was some confusion about when to use \N versus when to use $N as a captured-group back-reference. It can only be one or the other, so figuring it out is generally not a big issue. But, this does happen to be one of those things that is a bit different in each technology. As such, I thought I would write up a quick comparison of the regular expression back-references for the three main languages that I use: ColdFusion, Java, and Javascript.
In a regular expression, most anything wrapped in parenthesis is known a captured group. There are some exceptions to this, and you can use a syntax that performs non-captured grouping; but, for the most part, groups are captured from left-to-right by parenthesis. So, for example, in the following regular expression pattern:
(he(ll)o) (world)
... You would get the following captured groups:
- Group 1: (he(ll)o)
- Group 2: (ll)
- Group 3: (world)
These groups can be referenced using back-references - \N and $N - both when matching and replacing a given pattern. In either case, the "N" is the numerical digit, 1-9, that represents the index of the captured group. Which notation you use - \N or $N - depends both on the technology and the execution phase (matching vs. replacing) and is what I will be exploring below.
ColdFusion Regular Expressions
In ColdFusion, we can use the reFind() and reReplace() functions to find and replace regular expressions respectively. In the following script, I am going to use both functions to test the various back-reference approaches in the two phases of pattern execution:
<!--- Check to see if \N works within pattern. --->
<cfif reFind( "(ha) \1", "ha ha" )>
Find using \N
<cfelse>
No find using \N
</cfif>
<br />
<br />
<!--- Check to see if $N works within pattern. --->
<cfif reFind( "(ha) $1", "ha ha" )>
Find using $N
<cfelse>
No find using $N
</cfif>
<br />
<br />
<!--- Check to see if \N or $N works in replace. --->
<cfoutput>
#reReplace(
"ha ha",
"(ha) (ha)",
"\1-$2"
)#
</cfoutput>
As you can see here, we are using the string, "ha ha" in all cases. This is a nice string because it is composed of a repeated pattern, "ha." When we run the above code, we get the following output:
Find using \N
No find using $N
ha-$2
To break down what is happening, here's the type of notation that you can use in the two phases of ColdFusion regular expression pattern execution:
Matching: \N
Replacing: \N
Java Regular Expressions
ColdFusion is built on top of Java but, Java uses a different regular expression engine. Therefore, the pattern rules that apply to reFind() and reReplace() (POSIX) are not necessarily the same as the pattern rules that apply to instances of the Java class, java.util.regex.Pattern. In the following test, I am going to use the "undocumented" fact that ColdFusion strings are really Java strings and therefore provide access to the Java String's regular-expression-based methods:
<!--- Set string value. --->
<cfset value = "ha ha" />
<!--- Check to see if \N works within pattern. --->
<cfif value.matches( "(ha) \1" )>
Find using \N
<cfelse>
No find using \N
</cfif>
<br />
<br />
<!--- Check to see if $N works within pattern. --->
<cfif value.matches( "(ha) $1" )>
Find using $N
<cfelse>
No find using $N
</cfif>
<br />
<br />
<!--- Check to see if \N or $N works in replace. --->
<cfoutput>
#value.replaceFirst(
"(ha) (ha)",
"\1-$2"
)#
</cfoutput>
Again, we are using the string, "ha ha." But, this time, we are accessing the matches() and replaceFirst() methods directly on the value, "ha ha." When we run the above code, we get the following output:
Find using \N
No find using $N
1-ha
To break down what is happening, here's the type of notation that you can use in the two phases of Java regular expression pattern execution:
Matching: \N
Replacing: $N
NOTE: The reason we get the "1" in the replace string is because in a regular expression, the syntax \X (where X is a non-special-character) simply denotes a literal character match. You'll also note that since we are executing Java through a ColdFusion context, we don't need to escape back-slashes in strings.
Javascript Regular Expressions
While the Javascript engine is not as robust as some of the other regular expressions engines, it can do some pretty amazing stuff when it comes to string matching and replacing. If you look at my general Javascript regular expression overview, you'll see that Javascript has a number of regex functions that can work in a variety of ways. That said, let's run the same demo as above, this time in a Javascript context:
<!DOCTYPE HTML>
<html>
<head>
<title>Javascript Regular Expressions</title>
<script type="text/javascript">
// Check to see if \N works in pattern.
if ("ha ha".search( new RegExp( "(ha) \\1", "i" ) )){
document.write( "Find using \\N" );
} else {
document.write( "No find using \\N" );
}
document.write( "<br><br>" );
// Check to see if \$ works in pattern.
if ("ha ha".search( new RegExp( "(ha) \\$", "i" ) )){
document.write( "Find using \\$" );
} else {
document.write( "No find using \\$" );
}
document.write( "<br><br>" );
document.write(
"ha ha".replace(
new RegExp( "(ha) (ha)", "i" ),
"\\1-$2"
)
);
</script>
</head>
<body>
<!-- Intentionally left blank. -->
</body>
</html>
Unlike the ColdFusion context, when we are working with strings in Javascript, we do have to escape the back-slash as a special character. Therefore, when we use \N notation in a Javascript string, we have to use, \\N, such that when the string evaluates, out regular expression pattern is left with a proper back-reference, \N. When we run the above code, we get the following output:
No find using \N
Find using \$
\1-ha
To break down what is happening, here's the type of notation that you can use in the two phases of Javascript regular expression pattern execution:
Matching: $N
Replacing: $N
So there you have it - three powerful languages providing three different flavors of regular expression execution. I know these language are all running on different RegEx engines, but I am a bit curious as to why there is no standard on how back-references work. This seems like the kind of thing that would have been nailed down after PERL (or whoever) set the standard. In any case, I hope this helps. If you are a .NET or Ruby developer, I'd love to hear how they use back-references as well.
Want to use code from this post? Check out the license.
Reader Comments
I was always annoyed by the difference between how Homesite+ implemented RegEx backreference for find/replace and how CF does it. Why would the (admittedly old) CF IDE use a different backreference than CF itself?
@David,
I know exactly what you mean. I happen to love HomeSite. In fact, HomeSite is where I learned RegEx for the first time, using the Find/Replace to clean data exports from clients. HomeSite has allll kinds of differences. It's like a sub-set of the POSIX functionality. Very frustrating when simple things like (\r\n) don't work.
Homesite+ was also my introduction to ReGex. Back then, the "extended" find/replace feature made it easy to include line breaks, even in your RegEx--as long as you put them in as literals! That's sort of contrary to RegEx and probably stunted my growth/understanding of RegEx overall.
I'm generally pleased with RegEx support in eclipse/cfEclipse find/replace these days. And have switched entirely over to eclipse for all my CF, HTML, JS development. I've added the non-paid version of Aptana to Eclipse for HTML, CSS, JS but haven't begun to take advantage of of Aptana's JS library recognition--haven't figured out how to tell it that I'm using jQuery or even my own libraries with a certain CF page. But the standard JS intellisense, color coding and code formatting alone are enough to abandon Homesite.
@David,
The extended find/replace definitely made line breaks easier! In fact, that's part of why I love the big box so much after all these years. Of course, once I started learning more about regular expressions, I wanted to just use \r\n... but no such luck. Still, it's a great feature.