On an export file of more than 2600 bookmarks from Firefox, I want to import them into Buku which seems to bug with the ICON in the html file. So I want to substitute the ICON reference for nothing. Here's an example, the shortest one:
ICON="data:image/png;base64,PHN2ZyB4bWxucz0naHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmcnIHdpZHRoPScxNicgaGVpZ2h0PScxNic+IDxwYXRoIGQ9J00wIDBoMTZ2MTZIMHonLz4gPHBhdGggZD0nTTEzLjk5NCAxMC4zNTZIMTVWMTJoLTMuMTcxVjcuNzQxYzAtMS4zMDgtLjQzNS0xLjgxLTEuMjktMS44MS0xLjA0IDAtMS40Ni43MzctMS40NiAxLjh2Mi42M2gxLjAwNlYxMkg2LjkxOFY3Ljc0MWMwLTEuMzA4LS40MzUtMS44MS0xLjI5MS0xLjgxLTEuMDM5IDAtMS40NTkuNzM3LTEuNDU5IDEuOHYyLjYzaDEuNDQxVjEySDF2LTEuNjQ0aDEuMDA2VjYuMDc5SDFWNC40MzVoMy4xNjh2MS4xMzlhMi41MDcgMi41MDcgMCAwIDEgMi4zLTEuMjlBMi40NTIgMi40NTIgMCAwIDEgOC45MzEgNS45MSAyLjUzNSAyLjUzNSAwIDAgMSAxMS40IDQuMjg0IDIuNDQ4IDIuNDQ4IDAgMCAxIDE0IDYuOXYzLjQ1OHonIGZpbGw9JyNmZmYnLz4gPC9zdmc+"
I've tried
sed -e 's/^ICON=\"data:image\/png;base64,^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/][AQgw]==|[A-Za-z0-9+/]{2}[AEIMQUYcgkosw048]=)?$\"$//g' firefox_bookmarks_copie.html > test1.html
sed -e 's/^ICON="[[:print:]]"$//gi' firefox_bookmarks_copie.html > test2.html
sed -e 's/^ICON="data:image(\/[^;]+;base64[^"]+)"$//g' firefox_bookmarks_copie.html > test3.html
awk '{gsub(/^ICON="[:print:]"$/,"");}' firefox_bookmarks_copie.html > copie4.html
AWK seems to cause me problems when saving in copie4.html
perl -0pe 's/^ICON="data:image(\/[^;]+;base64[^"]+)"$//' firefox_bookmarks_copie.html >> copie5.html
The https://regex101.com/r/sxFswz/1 site seems to be telling me that my subsitution REGEX is effective with
/ICON="data:image(\/[^;]+;base64[^"]+)"/g
Can you help me?
Assumptions:
ICON="..."
strings from the html fileUsing the following (heavily) modified sample html file for demo purposes:
$ cat bm.html
<!DOCTYPE html>
<html>
<head>
... some other stuff ...
</head>
<body>
... some other stuff ...
<DT><A HREF="https://www.inter.net/search/results/content/?abc" ICON="data:image/png;base64,PHN2ZyB...snip_#1...z4gPC9zdmc+">some description</A>
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
<DT><A HREF="https://www.inter.net/search/results/content/?abc" ICON="data:image/png;base64,PHN2ZyB...snip_#2...z4gPC9zdmc+">some description</A>
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
</body>
</html>
NOTE: the ^^^^^^^^^^^^^^
lines do not exist in bm.html
but are added here to highlight the strings we're looking for
General approach - look for the consecutive strings a) ICON="
, b) [^"]*
(string that contains no double quotes) and c) "
One sed
idea:
$ sed 's/ICON="[^"]*"//g' bm.html
<!DOCTYPE html>
<html>
<head>
... some other stuff ...
</head>
<body>
... some other stuff ...
<DT><A HREF="https://www.inter.net/search/results/content/?abc" >some description</A>
<DT><A HREF="https://www.inter.net/search/results/content/?abc" >some description</A>
</body>
</html>
One awk
idea:
$ awk '{gsub(/ICON="[^"]+"/,"")}1' bm.html
<!DOCTYPE html>
<html>
<head>
... some other stuff ...
</head>
<body>
... some other stuff ...
<DT><A HREF="https://www.inter.net/search/results/content/?abc" >some description</A>
<DT><A HREF="https://www.inter.net/search/results/content/?abc" >some description</A>
</body>
</html>
NOTE: for this particular html file the g
lobal options (sed
+ /g
; awk
+ gsub()
(as opposed to sub()
) is overkill since there's only one match per line; if linefeeds were to be removed (thus leaving a single long line of data) the g
lobal options insure all ICON="..."
matches are replaced within a single line