I have a large collection of bookmarks in google-chrome with links, sub-folders in between links and in some sub-folders even more sub-folders.
Now, I want to extract the URLs together with other information as plain text for further processing.
For this, I exported all my bookmarks from the google-chrome bookmark-manager to an html file named bookmarks_8_2_21.html.
An exemplary part of the file, which I'll be using in the following, is:
<!DOCTYPE NETSCAPE-Bookmark-file-1>
<!-- This is an automatically generated file.
It will be read and overwritten.
DO NOT EDIT! -->
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">
<TITLE>Bookmarks</TITLE>
<H1>Bookmarks</H1>
<DL><p>
<DT><H3 ADD_DATE="1606927410" LAST_MODIFIED="1620226362" PERSONAL_TOOLBAR_FOLDER="true">Bookmarks bar</H3>
<DL><p>
<DT><A HREF="javascript:location.href='org-protocol://capture?template=l&url='+encodeURIComponent(location.href)+'&title='+encodeURIComponent(document.title)+'&body='+encodeURIComponent(window.getSelection())" ADD_DATE="1607739285">org-capture-bookmark</A>
<DT><A HREF="https://www.google.de/" ADD_DATE="1554935207" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAACIklEQVQ4jYWSS0iUURTHf/fe8RvHooE2VlT2FNqUGWmNEYUR9lhEEVJhUIsoXOQuap1Rq6KHNQt3LaPAIOxhlNTChUwLMU3NR1CklUzg6xvPd1ro2KhTHjjcA/e8/uf/hzmmqsUiEheRLhHxp/2TiDxQ1aK5+ZmFeSJSrwuYiMRVNZKuMxnFz51zu9T3GX/6iPGmRqS/F5WAUMEawuUVRI5UYjwPEWl2zlUYY8YMgIjUW2vPBkPfSV6uYbKvJ+uW3rZSojfuABAEQdw5d96oajHQqr7P8IUqpL8X43lEjp3EK4mBtfgt75l4+4po7U3cytWZPbcyjUlTidv642ipDu7foX7bh2zgs92jDhHpUlWdbNmuEw15OvqweqE7ZjboCAEFADrSjs1LkRM7NAt3+bWRebfYudFx9XguwFqbwePs9z/mT/6NLdAHMBpex28W0/C1Y1Zy05VFM75nUwiAZVGT/v5sgdcA3UurOPUrxvXOFhJD7fOmdn4LeNc5NbpkfWimv5mWZ8KXFKdfXqInOYBnc6gsPEjZ8mKssbQOtvEkMczYl0oK8z3un4lgppbYkhZS3Fp7bnD0Jxeba+lODmTFviFcxq29NeRHDUEQ1DnnqtNSjohIo3Nutx+keNz9gmf9zfQkB0ChYMkK9q2KcaLwMJFQGFV9Y4w5YIwZzyBBI2lRLcD9PVXN/SdFqlokInUi0iEiE9P+UUTuqurmufl/AKTzsFGmvUNUAAAAAElFTkSuQmCC"></A>
<DT><H3 ADD_DATE="1606928255" LAST_MODIFIED="1606928255">Folder_1</H3>
<DL><p>
<DT><A HREF="https://stackoverflow.com/" ADD_DATE="1605695883" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAABXklEQVQ4jbWQsUsCYRjGn/fuSu/Sk3ALmlzNtoagKRqSaHMKGkKhEOV0KWispSXPQaglAnNobOgfaCyIcgicmxO9zFPv/N5WwTs5gt7x+5739/2eDwgw/bK67HcnBQG4Ag3L0LJ/BoBFDuDzTiGUCAywDC3bNbRtANCrwxaBziRZanAGcjADwR8AX1uGesEZyFGzXwO43VsKn07GaJa5lY/GMefUAYooEvaELDnCEW9M2I1V7GdPg04hlLAM7dYqqut67ftLNwdpMB5dgRfXdVMgHIFpx9egfbwYk0eDA2LKAWJMkK6cUOhOGdkpZmoQiy29OmwFq1AKb5CgQyakAXqQJKpELn/eJzPK1JKhPhHjk4EmMzUVmU/coVLkeXff672pk155YXUsxikCJQFeYVCSgCiAV920N311b+r37FslH413S+qaV86rggfIBbG38RRAN+2ZHzsTMKvGv80vvziHGAusG84AAAAASUVORK5CYII=">Stack Overflow - Where Developers Learn, Share, & Build Careers</A>
<DT><A HREF="https://stackexchange.com/" ADD_DATE="1605695914" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAABIUlEQVQ4ja2SvUoDQRSFvztZDSIKWwbxB2zs7eyCWGljYWUlCKKm0kfwEQRR8A0kaGEfW+2VgD+JGxMlELOsWxhJdizMrmtM1hVyYGDm3Pm4c5grtJXLaaM0UV8HTKJVH7fM43RamgCG75Yn7bOkUot/wADkU7V5YAVA+eaAyEIc+LXRpPbRitUolsTf7F64Ouqi9dbi0fGC89WqKRCK8B+4rwoirB3d/YpQqDa4r753BUv7s9ERouC+6jvC3vmTSqixQsXmoWxHQhrcYUOnbk623SCCiCTjwO2uQw7JQYCEb47OLOWLFXsaeAGev5Z2QEYIjTxwKyI7Vnbj8keEXppaPgj/zmbxdOswXI81SICjIdMJ0/G0nvKUN2dlM9fdap8MMGR5HOUBZgAAAABJRU5ErkJggg==">Hot Questions - Stack Exchange</A>
<DT><H3 ADD_DATE="1606928255" LAST_MODIFIED="1606928255">Subfolder</H3>
<DL><p>
<DT><A HREF="https://meta.stackexchange.com/" ADD_DATE="1605695986" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAAAm0lEQVQ4je2SsQ3CMBBF30XMglKyBB7DEiNATSpnADYAeQtgEcIyR2EiHONgU9DxSv/736fTFyKMv+1BHEW0u9i2B2imZnZlM4C45zyL+DFNz/HaUhzQN+nAJ3NOfwv4ln/ALwLGgsyR6lGRtBsLYvxQrLOg28kGoSDalZcOn51tewhBFRg/KIAim6tdnmKt+og5czXr4301pz0AqgIzDZOACvcAAAAASUVORK5CYII=">Meta Stack Exchange</A>
<DT><H3 ADD_DATE="1606928255" LAST_MODIFIED="1606928255">Another Subfolder</H3>
<DL><p>
<DT><A HREF="https://en.wikipedia.org/wiki/Main_Page" ADD_DATE="1605696025" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAABO0lEQVQ4jaWTMaoCMRCG/wnvDtELmHaxdAmIXcheZA9hYeMNxNZqsc81lu0X+2VLTzBj8V5C8uQ9UAcG5k+YP5kvhPATzCx4IZRSBAD0TnNuQu82J5NPmgFADcMAay2UUjifzwAA733S8zzDWgtrLeZ5xvV6xXK5hPcet9vte/5pmoSIJIQgURtj5HQ6CTOLc06maRJmFmaWuq5TjVg454qNrutEay0hBDkej8V6NC4M+r4XANL3fdo0xogxJul4UK4TxPV6Decc9vt9ArTb7XC/35MehgFVVZUUc7cQghCRjOOYTtNaS9d1wszStm3BgpnlKzfz3mO1WuFyuWCz2aBpGlhrcTgcsN1uAQCLxeLvG0RIRJRmjS9U13XB5wlinlrrgnTbtk/w/jWIDPL8PXvMzz9TzuLVZgB4AExRsO8ga8hoAAAAAElFTkSuQmCC">Wikipedia, the free encyclopedia</A>
</DL><p>
</DL><p>
<DT><A HREF="https://www.wikipedia.org/" ADD_DATE="1605696017" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAABO0lEQVQ4jaWTMaoCMRCG/wnvDtELmHaxdAmIXcheZA9hYeMNxNZqsc81lu0X+2VLTzBj8V5C8uQ9UAcG5k+YP5kvhPATzCx4IZRSBAD0TnNuQu82J5NPmgFADcMAay2UUjifzwAA733S8zzDWgtrLeZ5xvV6xXK5hPcet9vte/5pmoSIJIQgURtj5HQ6CTOLc06maRJmFmaWuq5TjVg454qNrutEay0hBDkej8V6NC4M+r4XANL3fdo0xogxJul4UK4TxPV6Decc9vt9ArTb7XC/35MehgFVVZUUc7cQghCRjOOYTtNaS9d1wszStm3BgpnlKzfz3mO1WuFyuWCz2aBpGlhrcTgcsN1uAQCLxeLvG0RIRJRmjS9U13XB5wlinlrrgnTbtk/w/jWIDPL8PXvMzz9TzuLVZgB4AExRsO8ga8hoAAAAAElFTkSuQmCC">Wikipedia</A>
<DT><A HREF="https://en.wikipedia.org/wiki/Beautiful_Soup_(HTML_parser)" ADD_DATE="1605696102" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAABO0lEQVQ4jaWTMaoCMRCG/wnvDtELmHaxdAmIXcheZA9hYeMNxNZqsc81lu0X+2VLTzBj8V5C8uQ9UAcG5k+YP5kvhPATzCx4IZRSBAD0TnNuQu82J5NPmgFADcMAay2UUjifzwAA733S8zzDWgtrLeZ5xvV6xXK5hPcet9vte/5pmoSIJIQgURtj5HQ6CTOLc06maRJmFmaWuq5TjVg454qNrutEay0hBDkej8V6NC4M+r4XANL3fdo0xogxJul4UK4TxPV6Decc9vt9ArTb7XC/35MehgFVVZUUc7cQghCRjOOYTtNaS9d1wszStm3BgpnlKzfz3mO1WuFyuWCz2aBpGlhrcTgcsN1uAQCLxeLvG0RIRJRmjS9U13XB5wlinlrrgnTbtk/w/jWIDPL8PXvMzz9TzuLVZgB4AExRsO8ga8hoAAAAAElFTkSuQmCC">Beautiful Soup (HTML parser) - Wikipedia</A>
</DL><p>
<DT><H3 ADD_DATE="1606928255" LAST_MODIFIED="1606928255">Folder_2</H3>
<DL><p>
<DT><A HREF="https://www.reddit.com/" ADD_DATE="1605696212" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAACdUlEQVQ4jWXTTYjVdRTG8c/5/e+945hNMhLEgFRUA1lgNYHojJjpDXduQkSqjRiKJNSu2rRJV0VQi14IKohCW8UEIRJZ05uUGESrBBOsJocayRznvvxPizup0dmcxeF5zoHzfAOSCBKybRP2SpO4SUjMSjOKV+Ooz67VhKXKdUZc75CwWzGk/neAQBXUuSi9oevp+NRFiCRMGDHqiIa2rkSNgqu3Xc5aUQwLfR9bsCO+8FcjyLzBIQ1tPR2hIVWD7ZEeforMsPLGyh331g7u6rr4xzbDnseBkltNqezW0xfRIIqqpDrSytHQGg7Tb6XZM6mzWNDQUwuP5xbrI9veU+xU61tUBHpCB+vWs3qcd9+mhYbUxDK1UKm905AmpdQTbh2n2wnj97F5F7fcRd1j7WaOv89Pp0JzKP36c2hKTMmHymJulHlgQ51/X8i8MJdZ9/N/1e9lzp3LnD+fuW+izo0y26VTXPljDvrIKl5+gn330+/R67B3gjefYdUYzSFKdUVWZD1reaRvv0rffJT6fe6epP0YVYNGi8nt3LxmYHjyWDp1Ii2Xsj5bpM+VCEX6+kNZVTy4i3s2Mf878+fZsJ32o7JqcGJ6EK8SIcxUz91mVp2PaEZx+gcREcYn0ty5cPp7fjvD8HVpZFQceSEcfjG1hDo7evYH5BavGLJfR8dlDXeuLR7YkcZuH8T4l9Ph+Af8eLLW0tfS1PVSHPPkIMqTVhh2WNM2/UgLWesITQNWulJLGo6iytA1rdjpqEtXYVpjhTEHhT2qsoygXiKqlAFVvXoBr/nTs/GdS5Y4+y/OW01hD6awesn/rDCj7/X4xJfXav4BhnocQyGrEocAAAAASUVORK5CYII=">reddit: the front page of the internet</A>
<DT><A HREF="https://www.youtube.com/" ADD_DATE="1574152707" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAABx0lEQVQ4jZ2TQWtTQRDHfzO7yUsNKSG0heJJ0YKnCvVSkHrV7+BBeu7Vk9+lH8CLN6EXk4Lo1V48lGClFBELGmmSmr73djy8fS8vll4c+LO7s7OzM///LhQmBmrg+uDtZrgIBQQAKyf/YQbiBexFt9t9niSP7ji3umrW7og0UfWL0ZaOzGYjmBxn2c/BdPpJxuNz+vDwFxxnqmaqZs4V8L5AuS6haqmqXcDpO3jCEN4YmEFmkMcxrSCSGqSh8GcGIY52Am91Ce5HJ5EYBRwbG45222HmUHVS+DX2DhASuKu3oAeolGSqCiDs7gqHh8L2thCCxOQSbxFAO9BToClzNQSJokwmsLUFgwHs78P6eklnlQho6c0axUKbTVhZgUYjHpeqCwE8kMWFxYNF9uVlGA5hbw8ODuqJzajqnPENPtfYDxU2N4OtrRVzkVDfC0V8/h2G/g98AR7ESkJF8tFRcYdzkOf15iRW6y/hlD48voCz+BbmELFrvhrG8OMjPBOAV3D7qXM791R73Var00qSJRoNTyifB5Dn+eVsNv19dTX+mmWj93n+4SWciM1LKkmqy7RoImFBqPqPfH39K7t/4A18f74nAH8Bjm35s3ZkOjEAAAAASUVORK5CYII=">YouTube</A>
<DT><H3 ADD_DATE="1606928255" LAST_MODIFIED="1606928255">stuff</H3>
<DL><p>
<DT><A HREF="https://www.pgadmin.org/" ADD_DATE="1566393697" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAAC4ElEQVQ4jU2TS2hdZRSFv7XPf26eNbV52jRp0tYYiVVDfUBBlEpHTuxIOhEzMHUi6EjswIEOBQV1YsFWyMCJCI6sGhBRJ1JDrVpMGnuJSXprTFqpjXmc8//bwb0R93zvvdb+9tL4qfeetBDOxFgMmeMOBiAhAHdHEu44gCC5kIXKQiz8VJA4CwwGKbowASbh7sSUGs2OEA44mHtyT3FY5ueCuw9sb27Gje3SKsEoo1PGqKY8o6UpR4AkortM8vrATMQYHfpDEcs40LPbHhrZq9mltdS5q8UO7u30mSvXdKlawx2KoqS1uclvb24Jl+fBMJOFzFJY3yjsscNDevXkExRlVB6yhlU4/cF5vvtlgVeeeZzDw336dfFPr9Zu6shIP1NfzvDZ97MKkrwsk2JKrG9u89qH0/TtaefFp4+ycnOd9186wehgD9dv/M3YUK+OH7kbgG9+rnoRnYAnudcPNz3zG1PTMwx072a47052tTUxOtjD5YU/mHjzY279s8WZl09wdGw/G1uFJAhIyOqHuv9AH2P7exkd6Karow1PTh0lxOjEmADIzGhQJngDG8DIvi59/fYkmRkA585foFq7wdhQL5++8Szg9Hd14O6Y1fEGAWVMxJT49qcFv7K8qkP9nVycrzG3tMrzb33C5FOP8PA9+7hUvc7iyl88eu8gRRkbChwPmSkz4/bGFqfPfk7XHW2UMSLBsfGDzF9b46OvfqS7o43XnzvukjS7tOZ5CIQQTEtrt/hhbtnnllfp6Wj31uZcABtbBSePPcD4oX7+V5r6YoYLc0tqb8nR+AvvFEWZzJMbUqoEM5CDK7nT1lzhvqFeDty1x4sy6vLvK35xvkYlD5jhenDy3WQiIdlOeMC1s66Mic3tkpgSIK+EjNbmHJyUIAtm2SKZBr0sI40k7igAkYeMpjyA5Dg4TkzJLcszeVo20ITQVRcSpP+MSvXsAcnrPxBTwpMnSUhW9eQT/wJc4GRalsmdmQAAAABJRU5ErkJggg==">pgAdmin - PostgreSQL Tools</A>
</DL><p>
<DT><A HREF="https://www.gnu.org/software/emacs/" ADD_DATE="1605696341" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAADN0lEQVQ4jW2TXWibZRiG7/f7S76k+WKS/iTNTLRzSSSr60TmWEVQ8GdSJjpxIIh4oB4peqQFj/TAE0EEEakgKII60ENxYwpSmTQd27Ru0OiSNj9Nk7Rp/r6kyfe+7+PB1qFj99HzwH0/XAfPzXAbzU29HfONed/SXeaklJLtDvq5Tm3j4zPrC5Vbvey/y8np+dTI6PjnoVj0kP/OsE/RVEBI8N0hGmul1na5cmGrWnj57LVPi3sZdW949vC7J8IHEqfjR6bTZtByMcZAgjAaNJFMhWBafrdrZHSKhuK5qJleztZ/L9w8cCo1nxhPHjgduS8RBWOAIJCQGAuZCAbcuP9IBI8cvxuXzm/A4wlZw17v0ZAS+zbfvGhrAGCGxxfCBxNREBAJe3Bsdh98PgM+y4DHq2OwK1DMteDYnMAlYlPpeLu9vQDgae2J+JsRKxqeYSAQJ/h9LqTSoyjkm/j6sz/Q2LRBnOg6FQFCQhGEgH/swccnXvSqDySPvxObST/GcB29XrGxsd7GwZlx6LqCarFLQ5uDhAQEAYIghYRLNTyVbrGkqIa+X1FUgBOIS4BLbFVsLJ5Zo2jMwuvvHWNPnkowt65CcgnJJcAJpjLCDOae0SAJ4BIkCJbPwNzzSexPhSC4ZFcubGLp7Brl/txCr+vcIJDQuAJFMIAAzenvrvH+AKqi4amTCdxzbwjd1gBffJChnWoP8kYIgqBwQBUqNK6g37cxFPaK0rfrnzQKlTYcovXVHSICpCRE4xaCIRM6ADgE1WHQuArdUaFxFbVmoe60Ol8yAHjh6PuLyemjsyQlJiZHcPihKLsrFYThVgECVE2B129g8+82vp+/TLIjsFj66ccfsh/OaQBgd2qvlXJXz0X3JcOVfBuVf5rkcWuIJYKIxCwEJ7ysWx1Q9lwV6BBWaplStZF99eYnrtYz9UlvssSH/GHLG/AamopAyINmvofC8g7yv22jcrEFpyHxVz1TXu/mXvl145vL/+tCtrZ0JeCJ/2K3dg6xIQsqXZeOvgKdq2AOodzKd1e2lzLl7uozPxe+Wr5tG/c0G3spfYfb/4apuMNEQg7EsNwY1D46X/zu2q3efwF9w4d36At8owAAAABJRU5ErkJggg==">GNU Emacs - GNU Project</A>
<DT><H3 ADD_DATE="1606928255" LAST_MODIFIED="1606928255">other stuff</H3>
<DL><p>
<DT><H3 ADD_DATE="1606928255" LAST_MODIFIED="1606928255">emacs</H3>
<DL><p>
<DT><A HREF="https://www.gnu.org/software/emacs/download.html" ADD_DATE="1605696357" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAADN0lEQVQ4jW2TXWibZRiG7/f7S76k+WKS/iTNTLRzSSSr60TmWEVQ8GdSJjpxIIh4oB4peqQFj/TAE0EEEakgKII60ENxYwpSmTQd27Ru0OiSNj9Nk7Rp/r6kyfe+7+PB1qFj99HzwH0/XAfPzXAbzU29HfONed/SXeaklJLtDvq5Tm3j4zPrC5Vbvey/y8np+dTI6PjnoVj0kP/OsE/RVEBI8N0hGmul1na5cmGrWnj57LVPi3sZdW949vC7J8IHEqfjR6bTZtByMcZAgjAaNJFMhWBafrdrZHSKhuK5qJleztZ/L9w8cCo1nxhPHjgduS8RBWOAIJCQGAuZCAbcuP9IBI8cvxuXzm/A4wlZw17v0ZAS+zbfvGhrAGCGxxfCBxNREBAJe3Bsdh98PgM+y4DHq2OwK1DMteDYnMAlYlPpeLu9vQDgae2J+JsRKxqeYSAQJ/h9LqTSoyjkm/j6sz/Q2LRBnOg6FQFCQhGEgH/swccnXvSqDySPvxObST/GcB29XrGxsd7GwZlx6LqCarFLQ5uDhAQEAYIghYRLNTyVbrGkqIa+X1FUgBOIS4BLbFVsLJ5Zo2jMwuvvHWNPnkowt65CcgnJJcAJpjLCDOae0SAJ4BIkCJbPwNzzSexPhSC4ZFcubGLp7Brl/txCr+vcIJDQuAJFMIAAzenvrvH+AKqi4amTCdxzbwjd1gBffJChnWoP8kYIgqBwQBUqNK6g37cxFPaK0rfrnzQKlTYcovXVHSICpCRE4xaCIRM6ADgE1WHQuArdUaFxFbVmoe60Ol8yAHjh6PuLyemjsyQlJiZHcPihKLsrFYThVgECVE2B129g8+82vp+/TLIjsFj66ccfsh/OaQBgd2qvlXJXz0X3JcOVfBuVf5rkcWuIJYKIxCwEJ7ysWx1Q9lwV6BBWaplStZF99eYnrtYz9UlvssSH/GHLG/AamopAyINmvofC8g7yv22jcrEFpyHxVz1TXu/mXvl145vL/+tCtrZ0JeCJ/2K3dg6xIQsqXZeOvgKdq2AOodzKd1e2lzLl7uozPxe+Wr5tG/c0G3spfYfb/4apuMNEQg7EsNwY1D46X/zu2q3efwF9w4d36At8owAAAABJRU5ErkJggg==">GNU Emacs download - GNU Project</A>
<DT><A HREF="https://www.gnu.org/software/emacs/documentation.html" ADD_DATE="1605696393" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAADN0lEQVQ4jW2TXWibZRiG7/f7S76k+WKS/iTNTLRzSSSr60TmWEVQ8GdSJjpxIIh4oB4peqQFj/TAE0EEEakgKII60ENxYwpSmTQd27Ru0OiSNj9Nk7Rp/r6kyfe+7+PB1qFj99HzwH0/XAfPzXAbzU29HfONed/SXeaklJLtDvq5Tm3j4zPrC5Vbvey/y8np+dTI6PjnoVj0kP/OsE/RVEBI8N0hGmul1na5cmGrWnj57LVPi3sZdW949vC7J8IHEqfjR6bTZtByMcZAgjAaNJFMhWBafrdrZHSKhuK5qJleztZ/L9w8cCo1nxhPHjgduS8RBWOAIJCQGAuZCAbcuP9IBI8cvxuXzm/A4wlZw17v0ZAS+zbfvGhrAGCGxxfCBxNREBAJe3Bsdh98PgM+y4DHq2OwK1DMteDYnMAlYlPpeLu9vQDgae2J+JsRKxqeYSAQJ/h9LqTSoyjkm/j6sz/Q2LRBnOg6FQFCQhGEgH/swccnXvSqDySPvxObST/GcB29XrGxsd7GwZlx6LqCarFLQ5uDhAQEAYIghYRLNTyVbrGkqIa+X1FUgBOIS4BLbFVsLJ5Zo2jMwuvvHWNPnkowt65CcgnJJcAJpjLCDOae0SAJ4BIkCJbPwNzzSexPhSC4ZFcubGLp7Brl/txCr+vcIJDQuAJFMIAAzenvrvH+AKqi4amTCdxzbwjd1gBffJChnWoP8kYIgqBwQBUqNK6g37cxFPaK0rfrnzQKlTYcovXVHSICpCRE4xaCIRM6ADgE1WHQuArdUaFxFbVmoe60Ol8yAHjh6PuLyemjsyQlJiZHcPihKLsrFYThVgECVE2B129g8+82vp+/TLIjsFj66ccfsh/OaQBgd2qvlXJXz0X3JcOVfBuVf5rkcWuIJYKIxCwEJ7ysWx1Q9lwV6BBWaplStZF99eYnrtYz9UlvssSH/GHLG/AamopAyINmvofC8g7yv22jcrEFpyHxVz1TXu/mXvl145vL/+tCtrZ0JeCJ/2K3dg6xIQsqXZeOvgKdq2AOodzKd1e2lzLl7uozPxe+Wr5tG/c0G3spfYfb/4apuMNEQg7EsNwY1D46X/zu2q3efwF9w4d36At8owAAAABJRU5ErkJggg==">GNU Emacs documentation - GNU Project</A>
</DL><p>
</DL><p>
<DT><A HREF="https://orgmode.org/" ADD_DATE="1605696413" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAAClElEQVQ4jZ1TzU8TcRSc/f36Ddt2aWkLKbi0FCSUCCSiRUTExCBGYyJKotGLHv03jJ41Xjx48aQXPRA1xujBGJRoJCIYAhQo1NKvbZfdbbct2/WkIQIhOqdJ3ps5vJlHsT9qAOgAqrsN6V6qjqGhcQeloeELp29bOJe0sbI6/y8Gnu5TA89bONvNoxWh3dQaOOkd6DMvTn6JApC2LzLbOHsu5D7R6HZ3NVD9hnJxPFh2e0BKKlS1jHq+CbGZH6vxleV7qiDFqtDLm8mNZQYAejxs5Erk0OP2poYgCjLKiTXMRYah9EUAWYYsiLCxNtjsLKpaFVJOhMPNYeHz1xgJ2s2t1473vuoPtwWh69jMZaGAAZn6iFJsDQylcHpcMFktIJTAaDGBUAoxm0d6PfnCcL63/c7hUIs9KxehCBlUJAnR5hAyfh7leALFLQ2s046ipCATi7/TCSNpqtaQTSZL0+9f3zX467j+LR1Q8wK2ZBESMUI62AlHOAxhaRkr07PfNtP5h7KYi868/fDy72sbNB1qQUijJArQ1BIS/laAUmiyAr1aRUGS30w+m3iwV9wklU7PavkMNFlG1N0IJRAC4/XBQBgYTCawdY4RH8/zexlQO0NFr4u9muI8SAfaYA53wcAAhBAYjEYQSus9LU3XWadzOb4QndthsCjKC2WrNWs6Mzpay/NgCANCKXRdh8lqgdlmRbVSsbJu7pLFap//ubj0fUcTo8nclNXnzXsDB0ac9S4YTEYwhKAgSqioKkpFFZYaG2rr7GOlMp4K6+uZ3ZqIjsGewc4jA5eNFqNDSmXnYrG1T0o8keK8PgfX6DrmavbfknJifuL+o26GYXZ9rv3ARsbOPmmL9Jz6H/Ef8Dzv/M1/AdxXB/z0rsGnAAAAAElFTkSuQmCC">Org mode for Emacs</A>
</DL><p>
</DL><p>
From this file, I want to extract the following information:
I got the first three requirements working with BeautifulSoup, but I can't seem to get the 4th working. So I'll try to explain this point a bit further.
Let's assume the following folder hierarchy:
Bookmarks
\_Bookmarks bar
\_Folder_1
\_Subfolder
\_Another Subfolder
\_Folder_2
\_stuff
\_other stuff
\_emacs
Ideally, I'd like to have the following exemplary output for a URL in 'Another Subfolder':
https://en.wikipedia.org/wiki/Main_Page
Wikipedia, the free encyclopedia
1605696025
Bookmarks bar/Folder_1/Subfolder/Another Subfolder
However this output would be very helpful already:
https://en.wikipedia.org/wiki/Main_Page
Wikipedia, the free encyclopedia
1605696025
Another Subfolder
My Code so far is:
from bs4 import BeautifulSoup
def read_in_file(filename):
f = open(filename, 'r')
soup = BeautifulSoup(f.read(), 'html.parser')
f.close()
return soup
soup = read_in_file('bookmarks_8_2_21.html')
for line in soup.find_all('a'):
print(line.get('href')) # 1) URL: works
print(line.get_text()) # 2) Description: works
print(line.get('add_date')) # 3) Add Date: works
dir = soup.find('h3') # 4) Folder hierarch/ path: not working
print(dir.contents) # only prints ['Bookmarks bar']
print()
Output of an entry so far:
https://en.wikipedia.org/wiki/Main_Page
Wikipedia, the free encyclopedia
1605696025
['Bookmarks bar']
I also experimented with siblings and found out how to print out the folder hierarchy, but I couldn't get it to work with the other code:
Code Snippet:
for dir in soup.find_all('h3', recursive=True):
print(dir.text)
Output:
Bookmarks bar
Folder_1
Subfolder
Another Subfolder
Folder_2
stuff
other stuff
emacs
Thank you for any help and suggestions!
The problem was probably with how your bookmarks file got imported or how BS
read that file. More specifically how it read the Description Term
or <DT>
element. This is because these tags are not closed in your exported file. Therefore it doesn't know where the tag should be closed thereby closing it some random places.
So I closed the tags on the same line as they started and after that it should be easy for you to extract the data.
from bs4 import BeautifulSoup
soup = BeautifulSoup()
with open('bookmarks.html') as f:
soup = BeautifulSoup(f.read(), 'lxml')
dt = soup.find_all('dt')
folder_name =''
for i in dt:
n = i.find_next()
if n.name == 'h3':
folder_name = n.text
continue
else:
print(f'url = {n.get("href")}')
print(f'website name = {n.text}')
print(f'add date = {n.get("add_date")}')
print(f'folder name = {folder_name}')
print()
This small section of o/p would be helpful I hope:
url = https://stackoverflow.com/
website name = Stack Overflow - Where Developers Learn, Share, & Build Careers
add date = 1605695883
folder name = Folder_1
url = https://stackexchange.com/
website name = Hot Questions - Stack Exchange
add date = 1605695914
folder name = Folder_1
url = https://meta.stackexchange.com/
website name = Meta Stack Exchange
add date = 1605695986
folder name = Subfolder
url = https://en.wikipedia.org/wiki/Main_Page
website name = Wikipedia, the free encyclopedia
add date = 1605696025
folder name = Another Subfolder
url = https://www.wikipedia.org/
website name = Wikipedia
add date = 1605696017
folder name = Another Subfolder
Here I have assumed that whatever link that was under a folder name it belonged to that folder but this may change because of the reason I've added below.
If you want to be more accurate with the result then you should consider closing the p
tags also as they are also left open to be filled out anywhere.
The way forward to do that would be to by finding the dl
tags and traversing inside them separately to find out which dt
tag comes under which folder or dl
element precisely.
This a very specific type of problem because not all of us save the bookmarks in the same manner. Also you have to note that the html varies according to the organization of the folder. Eg: If links come first or sub-folder comes first, accordingly the html file will also change.