Search code examples
htmlregexvbscriptregex-greedynon-greedy

RegEx HTML matching too much with lazy wildcard


RegEx:

<span style='.+?'>TheTextToFind</span>

HTML:

<span style='font-size:11.0pt;'>DON'T_WANT_THIS_MATCHED <span style='font-size:18.0pt;'>TheTextToFind</span></span>

Why does the match include this?

<span style='font-size:11.0pt;'>DON'T_WANT_THIS_MATCHED

Example Link


Solution

  • The regex engine always find the left-most match. That's why you get

    <span style='font-size:11.0pt;'>DON'T_WANT_THIS_MATCHED <span style='font-size:18.0pt;'>TheTextToFind</span>
    

    as a match. (Basically the whole input, sans the last </span>).

    To steer the engine in the correct direction, if we assume that > doesn't appear directly in the attribute, the following regex will match what you want.

    <span style='[^>]+'>TheTextToFind</span>
    

    This regex matches what you want, since with the above assumption, [^>]+ can't match outside a tag.

    However, I hope that you are not doing this as part of a program that extracts information out of a HTML page. Use HTML parser for that purpose.


    To understand why the regex matches as such, you need to understand that .+? will try to backtracks so that it can find a match for the sequel ('>TheTextToFind</span>).

    # Matching .+?
    # Since +? is lazy, it matches . once (to fulfill the minimum repetition), and
    # increase the number of repetition if the sequel fails to match
    <span style='f                        # FAIL. Can't match closing '
    <span style='fo                       # FAIL. Can't match closing '
    ...
    <span style='font-size:11.0pt;        # PROCEED. But FAIL later, since can't match T in The
    <span style='font-size:11.0pt;'       # FAIL. Can't match closing '
    ...
    <span style='font-size:11.0pt;'>DON'  # PROCEED. But FAIL later, since can't match closing >
    ...
    <span style='font-size:11.0pt;'>DON'T_WANT_THIS_MATCHED <span style='
                                          # PROCEED. But FAIL later, since can't match closing >
    ...
    <span style='font-size:11.0pt;'>DON'T_WANT_THIS_MATCHED <span style='font-size:18.0pt;
                                          # PROCEED. MATCH FOUND.
    

    As you can see, .+? attempts with increasing length and matches font-size:11.0pt;'>DON'T_WANT_THIS_MATCHED <span style='font-size:18.0pt;, which allows the sequel '>TheTextToFind</span> to be matched.