Non-Greedy Regular Expression Misunderstanding

By Ben Nadel

Published 2006-09-29 in ColdFusion — Comments (17)

I have recently started to use non-greedy regular expressions. Case in point, I am using non-greedy regular expression searches to match link HREFs. Something is going wrong though. It is matching very large strings... I must be misunderstanding how non-greedy searches run. Uggg. Gotta go back to the RegEx drawing board.

For instance, I had the non-greedy search:

(<a.+?href="?)

To me, that should match the shortest possible link, ie, the shortest matching string that starts with <a and ends with href=". But it's not! It matching waaaay too much. I had to replace it with:

(<a[">]+?href="?)

This works, but does not feel as nice.

Short link: https://bennadel.com/302

Reader Comments

Rick O Sep 29, 2006 at 11:37 PM

153 Comments

I learned very early on to avoid the dot regex operator as much as possible. I started learning about regexes when I printed out all of the Perl man pages to take with me on my vacation to Iceland 10 years ago (geek!). I had no computer, but was determined to be able to use Perl by the time I got back. The perlre man page was ginormous, and I still to this day remember reading about the various meanings of the dot operator depending on the switches. It meant one thing if the single-line switch was set, another for multi-line, another thing for greedy, and another for non-greedy, and I think yet another for look-ahead and look-back assertions. There were copious warning about not changing your regex switches once you had it working, and several of them mentioned the dot operator.

In short, I learned to avoid it like the plague.

For your specific example, why not use a whitespace class, like "\s" or "[[:space:]]"?

Ben Nadel Sep 30, 2006 at 10:16 AM

16,125 Comments

Rick,

Yeah, I hear ya. When I really stepped through the regular expression, I started to understand why it was matching so much. I had too much faith in the non-greedy expression. I mean, it was doing it exactly right, but I wanted it to read my brain a little more :)

I am cleaning it up now and it should be working fine. Non-greedy searching is pretty awesome though.

And as far as dot stuff... I am actually using this inside of a Java Pattern / Matcher object and I have flagged the DOTALL so . should match any character including the new line.

Hey, nothing wrong with printing out a computer manual and taking it with you... why do you tihnk I have a $40 industrial strength hole puncher ;)

Steven Levithan Feb 2, 2007 at 12:38 AM

172 Comments

FYI, the term for "non-greedy" is lazy. There are three types of regex quantifiers: greedy, lazy, and possessive. Unfortunately, both ColdFusion and JavaScript only support the first two.

I've heard from an Adobe rep that CF8 will be beefing up CF's regex support. I have no idea what they'll add, though. Hopefully lookbehinds (and variable-length lookbehinds, at that) will be the first thing on the list.

One software recommendation if you haven't already seen it: RegexBuddy (http://regexbuddy.com/). Trust me, it's well worth the measly $30.

Ben Nadel Feb 2, 2007 at 7:19 AM

16,125 Comments

I've never heard of a "possessive" regular expression. What does that do? Is that available in Java? RegEx Buddy looks pretty badasss. I use something called the RegEx Coach. Similar idea, but light years behind what it looks like the "Buddy" does with explanation trees of a given expression. Cool stuff.

Ben Nadel Feb 2, 2007 at 7:22 AM

16,125 Comments

Also, I just tried looking up possessive and on about.com their term for "Lazy" is "Reluctant." I think as long as we all know what we are talking about, it's all good.... but ok, back to finding out what possessive is.

Ben Nadel Feb 2, 2007 at 7:35 AM

16,125 Comments

Hey Steve, About.com says this:

----
"The third type of match is the possessive match. A possessive match is one that compares the regular expression to the entire text string and only matches when the whole string satisfies the criteria. A regular expression is converted to match possessively by appending a + symbol to the end of the text. Actually there is little point in defining possessive qualifiers in regular expressions at the present time as both Internet Explorer 6 and Opera 7 do not understand them at all and produce an error while the Mozilla based browsers such as Firefox simply ignore the possessive qualification and treat the expression as greedy."
----

Maybe I am really confused here, but is this any different than doing "^regex$"? It seems like its just a standard string match on the entire string.

I just found another explanation over on Sun that the "+" after a selector will match on the entire string. This seems very confusing to me as they give this as an example:

.*+foo
with input: xfooxxxxxxfoo

Which will fail because ".*+" will match on the entire string and then there is no more room left for "foo". This seems like a very silly quantifier and at best, a short hand for ^...$.

Maybe I am totally missing the mark here though, can you please lend some insight to the possessive quantifier. Thanks!

Steven Levithan Feb 2, 2007 at 3:40 PM

172 Comments

Possessive quantifiers are most frequently used to improve performance for certain types of patterns. BTW, possessive quantifiers are simply a limited form of atomic grouping with a shorter notation. Atomic groups are more powerful because you can wrap them around any portion of your pattern, not just the portions you're repeating. With atomic groups/possessive quantifiers, the grouping is treated as one token, and once the regex engine leaves the group it can't backtrack into it (though it can backtrack to tokens before it and try again from there).

This will probably be easier to understand if we break down how regex engines actually work internally. First, let's take a look at the simple (but useless) example from Sun.

Pattern: .*+foo
Input: xfooxxxxxxfoo

Before I get to that, let's take a look at what the more widely used greedy and lazy quantifiers would do.

Using greedy repetition (.*foo), the regex engine would go through the following steps when trying to match the input:

xfooxxxxxxfoo
xfooxxxxxxfoo[backtrack]
xfooxxxxxxfo
xfooxxxxxxfo[backtrack]
xfooxxxxxxf
xfooxxxxxxf[backtrack]
xfooxxxxxx
xfooxxxxxxf
xfooxxxxxxfo
xfooxxxxxxfoo
[Match found]

Note that since the regex started with ".*", the very first step was matching the entire line. That step is done very fast. However, since the regex pattern ends with the literal string "foo", the regex engine then backtracks one character at a time, trying to find the minimum number of characters it needs to give up to match "foo" (that's why they're called greedy quantifiers).

If the string we were testing against was "xfooxxxxxxfoofo", the pattern would require several more steps to test, as I'll show below:

xfooxxxxxxfoofo
xfooxxxxxxfoofo[backtrack]
xfooxxxxxxfoof
xfooxxxxxxfoof[backtrack]
xfooxxxxxxfoo
xfooxxxxxxfoof - Matched the first character in our literal string "foo"
xfooxxxxxxfoofo - Matched the second character in our literal string "foo"
xfooxxxxxxfoofo[backtrack] - Oops, not enough o's at the end, so backtrack one character further than we've backtracked before
xfooxxxxxxfo
xfooxxxxxxfo[backtrack]
xfooxxxxxxf
xfooxxxxxxf[backtrack]
xfooxxxxxx
xfooxxxxxxf
xfooxxxxxxfo
xfooxxxxxxfoo
[Match found]

Note that backtracking can easily get out of hand with certain types of patterns, especially when they're poorly written. In this case, the issue is that the "f"s and "o"s in "foo" are matched by the "." operator, which greedily repeats right past them (then backtracking occurs). However, note that sometimes this is still preferable and faster than other solutions like lazy repetition. It depends entirely on your pattern and the strings you're testing it against.

If we'd used lazy repetition (.*?foo), regex engines would only take five steps to find the first match in both "xfooxxxxxxfoo" and "xfooxxxxxxfoofo":

[backtrack]
x
xf
xfo
xfoo
[Match found]

Note that it steps through the string one character at a time until it can match the literal text "foo". In this case, lazy repetition would be faster.

Now that that's out of the way, here's how a possessive quantifier (.*+foo) would change the game when testing against "xfooxxxxxxfoo":

xfooxxxxxxfoo
xfooxxxxxxfoo[backtrack] - Since we're already at the end of the string, it tries to backtrack to find the literal text "foo", but the possessive quantifier won't give back any of its match
[Match attempt failed]

Note that while it failed a match, it did so extremely quickly. In fact, it took less steps to fail that (2) than trying to match the literal string "xfx" would have (3).

Note that there are certainly many practical uses for atomic groups beyond simply improving performance. For good examples of cases where you're fcuked if you don't have the performance-enhancing power of atomic grouping/possessive quantifiers available to you, I defer to the following excellent page which also explains all of this stuff in more depth: http://regular-expressions.info/atomic.html

Steven Levithan Feb 3, 2007 at 5:29 PM

172 Comments

Oops. Sorry about accidently not closing one of my strong tags in the previous comment. BTW, regarding your original post, it might not be obvious to some readers where the problem was coming from in your regex. First of all, there is nothing inherently wrong with the regex (<a.+?href="?) (aside from the fact that there should be a space character after "a" to prevent the pattern from dealing with, e.g., "<acronym"). The problem arises when an "a" tag does not have it's own href attribute, and "href=" appears later in the string. Replacing the dot with [^>] simply ensures that the match attempt stays within one "a" tag.

Ben Nadel Feb 3, 2007 at 7:49 PM

16,125 Comments

Steve, I feel like that scene from Wayne's World where they are back stage with Alice Cooper making small talk and then drop to the floor "We're not worthy, we're not worthy, we're not worthy." :) I thought I new a lot about regular expressions and now you come along and make me fee like an amateur. Of all the sites in all the world, you had to walk into mine.

That's awesome though, I see now that there is so much more to learn. I never really thought about optimizing regular expressions. I guess, I never really thought that much could be done about it. Thanks!

Your explanations are excellent. I did not know that it .* would read in a whole line then backtrack. That feels horribly inefficient (but required for the match).

Steven Levithan Feb 3, 2007 at 10:18 PM

172 Comments

LOL! :) And thanks, Ben.

I, in turn, feel the same way about your ColdFusion / Java skillz. I've got a ways to go, but I'm working on it.

And I've got a ways to go till mastering regular expressions, myself. In particular, now that you've shown me how to use the more powerful, underlying Java regex library in ColdFusion, I'm interested in looking at creative ways to exploit the myriad unicode character properties ( http://regular-expressions.info/unicode.html#prop ) that can be used as regex tokens.

Steven Levithan Feb 4, 2007 at 2:49 AM

172 Comments

[I never really thought about optimizing regular expressions. I guess, I never really thought that much could be done about it.]

The types of performance differences I've demonstrated on this page are trivial, and not worth fretting over. After a while, though, writing optomized regexes just comes naturally and doesn't take any extra time.

However, the implications of this stuff have the potential to be massive. I could easily write a reasonably short regex / test string pair which would trigger catastrophic backtracking, increasing the number of steps taken by the regex engine exponentially with every couple charaters added to the test string. Such a regex would take a hell of a long time to run (think hours, days, or more), and could potentially even crash the server as the regex engine runs out of memory trying to remember all backtracking positions.

SB Aug 2, 2009 at 2:07 AM

1 Comments

Hey guys, despite all of the comments in this thread, the original question has not been answered: how come non-greedy repetition quantifiers sometimes match too much? I have encountered the same issue as the original poster, although with a different regex. Does this really come down to "you can never use the dot regex operator" inside a repeated non-greedy sub-pattern? That seems kind of weird.

Ben Nadel Aug 5, 2009 at 9:25 AM

16,125 Comments

@SB,

This is an old post, so I don't remember what the original problem was. My guess was that I had anchor tags that did not have HREF attributes or something. So, for example, if I had code like this (my comments don't allow 'A' tags, so I am escaping them:

some text [a name="anchor"] foo [a href=""]link.[/a] and more.

... my first regex would have matched the entire string:

[a name="anchor"] foo [a href=""

The first anchor tag didn't have an HREF attribute, but it got matched. I had to add the [^>] character class to make sure the HREF was in the same tag as the first "A" match.

Pierre Feb 5, 2010 at 1:39 PM

1 Comments

you would not match this because of the end tag in the string

[a name="an>hor" href="TheLink"]

You need a more complex regular expression to filter the attributes and the attribute values before matching the first href attribute.

[a\s+((\w+="[^"]*|\w+=\w?)\s+)??href="]

Ben Nadel Feb 6, 2010 at 5:31 PM

16,125 Comments

@Pierre,

Yeah, definitely. I can't remember what the original situation was, but I think something else must have been going wrong. For my purposes, the lazy patter *should* have worked. And, today, I can't even replicate the problem I was having then (I just tried). But, you are correct, my simplistic pattern would have quickly failed on a number of different tag variations.

oO May 26, 2010 at 6:17 AM

1 Comments

It matches the right thing. You are mistaken in your regexp. You simply don't understand non-greediness.

You write:
({a.+?href="?)
href=" is optional, so you get:
{a.+?
And it matches everything right from {a

Specify the limiter, so that + can be non-greedy:
{a.+?}
Now it will match everything between {a and }
And it will match the shortest matching text (one tag).

When greedy, it would match } from the last tag in the text. Here:
AAA{a href=""}{/a}{b}BBB{/b}CCC
greedy expresion will match
{a href=""}{/a}{b}BBB{/b}
Non greedy - only {a href=""}

PS. Repair the comment system, as it does not allow to insert greater-then and lesser-then signs - that's why I use { and }. Or change the blogging system.

Ben Nadel Jun 8, 2010 at 9:59 AM

16,125 Comments

@oO,

I *know* that I was mistaken - that was the point of the post :) However, the second question mark does not make herf=" optional - it only makes the quote optional (as the ? is applying to a character, not a group).

As far as < >, my commenting system does not allow {A} tags specifically (due to retarded amounts of spam that use that approach for linking).

Oh my chickens, this post is old!

Hit me up on LinkedIn if you want to discuss it further.