I use BBEdit. BBEdit supports multi-file search and replace with GREP. Using this (copied from a Notepad++ post here at stackoverflow):
(\bhref="|(?!^)\G)[^"<_]*\K_
I can get a list of all URLs containing underscores. The idea is to replace all the underscores with dashes. No problems with that, BBEdit search panel has a 'Replace with' field (like Notepad++).
All's fine, BUT I don't want to process all URLs actually. There are for example file download URLs that should remain as they are, especially URLs with the .exe, .zip, .sit and .dmg extensions. Actually the URLs I want to process are the .php and .html urls.
I mean this type of URL should here found:
<a href="software/internet-tools/ftp-disk_sheet_us.php">
but not this one:
<a href="software/internet-tools/ftp-disk_us_setup.exe">
I have tried to edit the REGEX above unsuccessfully so far and since I have to process around 30,000 urls in 600 files I would really like to be sure I don't do anything wrong.
Thanks a lot in advance for helping me out with that.
You may force the match only when the link ends with .html
/.htm
or .php
:
(?:\G(?!^)|\bhref="(?=[^"]*\.(?:html?|php)"))[^"<_]*\K_
^^^^^^^^^^^^^^^^^^^^^^^^^
See the regex demo
The (?=[^"]*\.(?:html?|php)")
positive lookahead will require any 0+ chars other than "
and then a .
followed with htm
/html
or php
immediately after href="
, else, no match will be found.
Details
(?:\G(?!^)|\bhref="(?=[^"]*\.(?:html?|php)"))
- end of the previous match (\G(?!^)
) or (|
)
\bhref="
- a whole word href
followed with ="
(?=[^"]*\.(?:html?|php)")
- a positive lookahead that requires the following sequences of patterns to match immediately to the right of the current location:
[^"]*
- 0+ chars other than "
\.
- a dot(?:html?|php)
- a non-capturing group matching either htm
and then an optional l
or php
"
- a double quotation mark[^"<_]*
- any 0+ chars other than "
, <
and _
\K
- match reset operator that discards all text matched so far_
- an underscore.