Search code examples
regexgrepnotepad++bbedit

Replace underscore with dash in url for given url extensions using GREP / Regex


I use BBEdit. BBEdit supports multi-file search and replace with GREP. Using this (copied from a Notepad++ post here at stackoverflow):

(\bhref="|(?!^)\G)[^"<_]*\K_

I can get a list of all URLs containing underscores. The idea is to replace all the underscores with dashes. No problems with that, BBEdit search panel has a 'Replace with' field (like Notepad++).

All's fine, BUT I don't want to process all URLs actually. There are for example file download URLs that should remain as they are, especially URLs with the .exe, .zip, .sit and .dmg extensions. Actually the URLs I want to process are the .php and .html urls.

I mean this type of URL should here found:

<a href="software/internet-tools/ftp-disk_sheet_us.php">

but not this one:

<a href="software/internet-tools/ftp-disk_us_setup.exe">

I have tried to edit the REGEX above unsuccessfully so far and since I have to process around 30,000 urls in 600 files I would really like to be sure I don't do anything wrong.

Thanks a lot in advance for helping me out with that.


Solution

  • You may force the match only when the link ends with .html/.htm or .php:

    (?:\G(?!^)|\bhref="(?=[^"]*\.(?:html?|php)"))[^"<_]*\K_
                       ^^^^^^^^^^^^^^^^^^^^^^^^^
    

    See the regex demo

    The (?=[^"]*\.(?:html?|php)") positive lookahead will require any 0+ chars other than " and then a . followed with htm/html or php immediately after href=", else, no match will be found.

    Details

    • (?:\G(?!^)|\bhref="(?=[^"]*\.(?:html?|php)")) - end of the previous match (\G(?!^)) or (|)
      • \bhref=" - a whole word href followed with ="
      • (?=[^"]*\.(?:html?|php)") - a positive lookahead that requires the following sequences of patterns to match immediately to the right of the current location:
        • [^"]* - 0+ chars other than "
        • \. - a dot
        • (?:html?|php) - a non-capturing group matching either htm and then an optional l or php
        • " - a double quotation mark
    • [^"<_]* - any 0+ chars other than ", < and _
    • \K - match reset operator that discards all text matched so far
    • _ - an underscore.