Search code examples
javascriptregexnon-greedy

Javascript replace() regular expression too greedy


I am trying to sanitize an HTML input field. I want to keep some of the tags, but not all of them, so I can't just use .text() when reading the element value. I am having a bit of trouble with a regular expression in JavaScript in Safari. Here's the snippet of code (I copied this bit of regex from another SO thread answer):

aString.replace (/<\s*a.*href=\"(.*?)\".*>(.*?)<\/a>/gi, '$2 (Link->$1)' ) ;

Here is the sample input that is failing:

<a href="http://blar.pirates.net/black/ship.html">Go here please.</a></p><p class="p1"><a href="http://blar.pirates.net/black/ship.html">http://blar.pirates.net/black/ship.html</a></p>

The idea is that the href will get pulled out and output as plain text next to the text that would have been linked. So the above output should ultimately be something like:

Go here please (Link->http://blar.pirates.net/black/ship.html)
http://blar.pirates.net/black/ship.html (Link->http://blar.pirates.net/black/ship.html)

However, the regex is grabbing all the way down to the second </a> tag on the first match, so I am losing the first line of output. (Actually, it will grab as far down the list as long as the anchor elements are adjacent.) The input is one long string, not split over lines with a CR/LF or anything.

I have tried using a non-greedy flag like this (note the 2nd question mark):

/<\s*a.*href=\"(.*?)\".*?>(.*?)<\/a>/ig

But that didn't seem to change anything (at least not in the few tester/parsers I tried, like https://regex101.com/r/yhmT8w/1). Have also tried the /U flag but that didn't help (or these parsers didn't recognize it).

Any suggestions?


Solution

  • There are several mistakes in the pattern and possible improvements:

    /<
    \s*    #  not needed (browsers don't recognize "< a" as an "a" tag)
    
    a      #  if you want to avoid a confusion between an "a" tag and the start
           # of an "abbr" tag, you can add a word boundary or better, a "\s+" since
           # there is at least one white character after.
    
    .      #  The dot match all except newlines, if you have an "a" tag on several
           # lines, your pattern will fail. Since Javascript doesn't have the 
           # "singleline" or "dotall" mode, you must replace it with `[\s\S]` that
           # can match all characters (all that is a space + all that is not a space)
    
    *      #  Quantifiers are greedy by default. ".*" will match all until the end of
           # the line, "[\s\S]*" will match all until the end of the string!
           # This will cause to the regex engine a lot of backtracking until the last
           # "href" will be found (and it is not always the one you want)
    
    href=  # You can add a word boundary before the "h" and put optional spaces around
           # the equal sign to make your pattern more "waterproof": \bhref\s*=\s*
    
    \"     #  Don't need to be escaped, as Markasoftware notices it, an attribute
           # value is not always between double quotes. You can have single quotes or
           # no quotes at all. (1)
    (.*?)
    \"     # same thing
    .*     # same thing: match all until the last >
    >(.*?)<\/a>/gi
    

    (1) -> About the quotes and the href attribute value:

    To deal with single, double or no quotes you can use a capturing group and a backreference:

    \bhref\s*=\s*(["']?)([^"'\s>]*)\1
    

    details:

    \bhref\s*=\s*
    (["']?)     # capture group 1: can contain a single, a double quote or nothing 
    ([^"'\s>]*) # capture group 2: all that is not a quote to stop before the possible
                # closing quote, a space (urls don't have spaces, however javascript
                # code can contain spaces) or a ">" to stop at the first space or
                # before the end of the tag if quotes are not used. 
    \1          # backreference to the capture group 1
    

    Note that is you use this subpattern you add a capturing group, and the content between a tags is now in the capture group 3. Think to change in your replacement string $2 to $3.

    In fine, you can write your pattern like this:

    aString.replace(/<a\s+[\s\S]*?\bhref\s*=\s*(["']?)([^"'\s>]*)\1[^>]*>([\s\S]*?)<\/a>/gi,
                   '$3 (Link->$1)');