Search code examples
htmlsedmultiline

Comment out a whole hyperlink block with sed in an HTML file


I'd like to remove certain hyperlinks which all contain "legacy/" in the URL in many HTML files. However, some of them are in one line

<a href=".../legacy/..."> ... </a>\n

while others are not. How can I use sed to replace them all at one time?

So far I've tried

sed -ri 's/(.+legacy\/[[:print:]]+<\/a>.*$)/<!--\1-->/g' wave-on-a-string.html 

which only replaces hyperlink in one line. I then realized that sed read one line at a time only. However, I couldn't find out how to matches multi (uncertain number of) lines hyperlink block.

The HTML files have some contents like this:

      <a class="other-sim-page" href="legacy/wave-on-a-string.html" dir="ltr">
        <table>
          <tr>
            <td>
              <img style="display: block;" src="../../images/icons/sim-badges/flash-badge.png" alt="Flash Logo" width="44" height="44">
            </td>
            <td>
              <span class="other-sim-link">原始模擬教學與翻譯</span>
            </td>
          </tr>
        </table>
      </a>

...

          <p>瀏覽<a href="legacy/wave-on-a-string.html#for-teachers-header">更多活動</a>。</p>

...

                    <a href="legacy/radiating-charge.html" class="simulation-link">

                      <img class="simulation-list-thumbnail" src="../../sims/radiating-charge/radiating-charge-128.png" id="simulation-display-thumbnail-radiating-charge" alt="Screenshot of the simulation 電荷輻射" width="128" height="84"/><br/>
                        <strong><span class="simulation-list-title">電荷輻射</span></strong><br/>
                        <span class="sim-display-badge sim-badge-flash"></span>
                    </a>

...

and it only matches and replaces the second hyperlink since it is in one line.

I'd like to replace all the hyperlink blocks (<a href="..."> ... </a>) also if they stretch over several lines.


Solution

  • With GNU sed for -z and using all 3 blocks of input you provided together in one file as input:

    $ sed -z '
        s:@:@A:g; s:}:@B:g; s:</a>:}:g;
        s:<a[^<>]* href="legacy/[^}]*}:<!--&-->:g;
        s:}:</a>:g; s:@B:}:g; s:@A:@:g
    ' file
          <!--<a class="other-sim-page" href="legacy/wave-on-a-string.html" dir="ltr">
            <table>
              <tr>
                <td>
                  <img style="display: block;" src="../../images/icons/sim-badges/flash-badge.png" alt="Flash Logo" width="44" height="44">
                </td>
                <td>
                  <span class="other-sim-link">原始模擬教學與翻譯</span>
                </td>
              </tr>
            </table>
          </a>-->
    
    ...
    
              <p>瀏覽<!--<a href="legacy/wave-on-a-string.html#for-teachers-header">更多活動</a>-->。</p>
    
    ...
    
                        <!--<a href="legacy/radiating-charge.html" class="simulation-link">
    
                          <img class="simulation-list-thumbnail" src="../../sims/radiating-charge/radiating-charge-128.png" id="simulation-display-thumbnail-radiating-charge" alt="Screenshot of the simulation 電荷輻射" width="128" height="84"/><br/>
                            <strong><span class="simulation-list-title">電荷輻射</span></strong><br/>
                            <span class="sim-display-badge sim-badge-flash"></span>
                        </a>-->
    

    The first line turns } into a character than can't be present in the input afterwards by converting all }s to @Bs and then turns all </a>s into } so that char can be negated in a bracket expression as [^}] in the regexp for the string you want to replace, the second line does the actual replacement you want, and the third line restores all }s to </a>s and then @Bs to }s.

    Manipulating the input to create a char that can't exist in the input is a fairly common sed idiom to work around not being able to negate strings in regexps. See https://stackoverflow.com/a/35708616/1745001 for another example with additional explanation.

    This will of course fail if you have strings in your input that resemble the strings you're trying to match but in reality it's probably good enough for your specific input - you'll just have to think about what it does and check it's output to verify.