On an export file of more than 2600 bookmarks from Firefox, I want to import them into Buku which seems to bug with the ICON in the html file. So I want to substitute the ICON reference for nothing. Here's an example, the shortest one:
I've tried
sed -e 's/^ICON=\"data:image\/png;base64,^(?:[A-Za-z0-9+/]{4})*(?:[A-Za-z0-9+/][AQgw]==|[A-Za-z0-9+/]{2}[AEIMQUYcgkosw048]=)?$\"$//g' firefox_bookmarks_copie.html > test1.html
sed -e 's/^ICON="[[:print:]]"$//gi' firefox_bookmarks_copie.html > test2.html
sed -e 's/^ICON="data:image(\/[^;]+;base64[^"]+)"$//g' firefox_bookmarks_copie.html > test3.html
awk '{gsub(/^ICON="[:print:]"$/,"");}' firefox_bookmarks_copie.html > copie4.html
AWK seems to cause me problems when saving in copie4.html
perl -0pe 's/^ICON="data:image(\/[^;]+;base64[^"]+)"$//' firefox_bookmarks_copie.html >> copie5.html
The https://regex101.com/r/sxFswz/1 site seems to be telling me that my subsitution REGEX is effective with
Can you help me?
strings from the html fileUsing the following (heavily) modified sample html file for demo purposes:
$ cat bm.html
<!DOCTYPE html>
... some other stuff ...
... some other stuff ...
<DT><A HREF="https://www.inter.net/search/results/content/?abc" ICON="...snip_#1...z4gPC9zdmc+">some description</A>
<DT><A HREF="https://www.inter.net/search/results/content/?abc" ICON="...snip_#2...z4gPC9zdmc+">some description</A>
NOTE: the ^^^^^^^^^^^^^^
lines do not exist in bm.html
but are added here to highlight the strings we're looking for
General approach - look for the consecutive strings a) ICON="
, b) [^"]*
(string that contains no double quotes) and c) "
One sed
$ sed 's/ICON="[^"]*"//g' bm.html
<!DOCTYPE html>
... some other stuff ...
... some other stuff ...
<DT><A HREF="https://www.inter.net/search/results/content/?abc" >some description</A>
<DT><A HREF="https://www.inter.net/search/results/content/?abc" >some description</A>
One awk
$ awk '{gsub(/ICON="[^"]+"/,"")}1' bm.html
<!DOCTYPE html>
... some other stuff ...
... some other stuff ...
<DT><A HREF="https://www.inter.net/search/results/content/?abc" >some description</A>
<DT><A HREF="https://www.inter.net/search/results/content/?abc" >some description</A>
NOTE: for this particular html file the g
lobal options (sed
+ /g
; awk
+ gsub()
(as opposed to sub()
) is overkill since there's only one match per line; if linefeeds were to be removed (thus leaving a single long line of data) the g
lobal options insure all ICON="..."
matches are replaced within a single line