Search code examples
regexperlawksed

Substitute the ICON reference for nothing


On an export file of more than 2600 bookmarks from Firefox, I want to import them into Buku which seems to bug with the ICON in the html file. So I want to substitute the ICON reference for nothing. Here's an example, the shortest one:

ICON=""

I've tried

sed -e 's/^ICON=\"data:image\/png;base64,^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/][AQgw]==|[A-Za-z0-9+/]{2}[AEIMQUYcgkosw048]=)?$\"$//g' firefox_bookmarks_copie.html > test1.html

sed -e 's/^ICON="[[:print:]]"$//gi' firefox_bookmarks_copie.html > test2.html

sed -e 's/^ICON="data:image(\/[^;]+;base64[^"]+)"$//g' firefox_bookmarks_copie.html > test3.html

awk '{gsub(/^ICON="[:print:]"$/,"");}' firefox_bookmarks_copie.html > copie4.html

AWK seems to cause me problems when saving in copie4.html

perl -0pe 's/^ICON="data:image(\/[^;]+;base64[^"]+)"$//' firefox_bookmarks_copie.html >> copie5.html

The https://regex101.com/r/sxFswz/1 site seems to be telling me that my subsitution REGEX is effective with

/ICON="data:image(\/[^;]+;base64[^"]+)"/g

Can you help me?


Solution

  • Assumptions:

    • OP want's to remove ALL ICON="..." strings from the html file

    Using the following (heavily) modified sample html file for demo purposes:

    $ cat bm.html
    <!DOCTYPE html>
    <html>
      <head>
    ... some other stuff ...
      </head>
      <body>
    ... some other stuff ...
          <DT><A HREF="https://www.inter.net/search/results/content/?abc" ICON="...snip_#1...z4gPC9zdmc+">some description</A>
                                                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
          <DT><A HREF="https://www.inter.net/search/results/content/?abc" ICON="...snip_#2...z4gPC9zdmc+">some description</A>
                                                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
      </body>
    </html>
    

    NOTE: the ^^^^^^^^^^^^^^ lines do not exist in bm.html but are added here to highlight the strings we're looking for

    General approach - look for the consecutive strings a) ICON=", b) [^"]* (string that contains no double quotes) and c) "

    One sed idea:

    $ sed 's/ICON="[^"]*"//g' bm.html
    <!DOCTYPE html>
    <html>
      <head>
    ... some other stuff ...
      </head>
      <body>
    ... some other stuff ...
          <DT><A HREF="https://www.inter.net/search/results/content/?abc" >some description</A>
          <DT><A HREF="https://www.inter.net/search/results/content/?abc" >some description</A>
      </body>
    </html>
    

    One awk idea:

    $ awk '{gsub(/ICON="[^"]+"/,"")}1' bm.html
    <!DOCTYPE html>
    <html>
      <head>
    ... some other stuff ...
      </head>
      <body>
    ... some other stuff ...
          <DT><A HREF="https://www.inter.net/search/results/content/?abc" >some description</A>
          <DT><A HREF="https://www.inter.net/search/results/content/?abc" >some description</A>
      </body>
    </html>
    

    NOTE: for this particular html file the global options (sed + /g; awk + gsub() (as opposed to sub()) is overkill since there's only one match per line; if linefeeds were to be removed (thus leaving a single long line of data) the global options insure all ICON="..." matches are replaced within a single line