Search code examples
pythonhtmlgoogle-chromebeautifulsoupbookmarks

Extracting bookmarks and folder hierarchy from google chrome with BeautifulSoup


I have a large collection of bookmarks in google-chrome with links, sub-folders in between links and in some sub-folders even more sub-folders.
Now, I want to extract the URLs together with other information as plain text for further processing.
For this, I exported all my bookmarks from the google-chrome bookmark-manager to an html file named bookmarks_8_2_21.html.

An exemplary part of the file, which I'll be using in the following, is:

<!DOCTYPE NETSCAPE-Bookmark-file-1>
<!-- This is an automatically generated file.
     It will be read and overwritten.
     DO NOT EDIT! -->
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8">
<TITLE>Bookmarks</TITLE>
<H1>Bookmarks</H1>
<DL><p>
    <DT><H3 ADD_DATE="1606927410" LAST_MODIFIED="1620226362" PERSONAL_TOOLBAR_FOLDER="true">Bookmarks bar</H3>
    <DL><p>
        <DT><A HREF="javascript:location.href='org-protocol://capture?template=l&url='+encodeURIComponent(location.href)+'&title='+encodeURIComponent(document.title)+'&body='+encodeURIComponent(window.getSelection())" ADD_DATE="1607739285">org-capture-bookmark</A>
        <DT><A HREF="https://www.google.de/" ADD_DATE="1554935207" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAACIklEQVQ4jYWSS0iUURTHf/fe8RvHooE2VlT2FNqUGWmNEYUR9lhEEVJhUIsoXOQuap1Rq6KHNQt3LaPAIOxhlNTChUwLMU3NR1CklUzg6xvPd1ro2KhTHjjcA/e8/uf/hzmmqsUiEheRLhHxp/2TiDxQ1aK5+ZmFeSJSrwuYiMRVNZKuMxnFz51zu9T3GX/6iPGmRqS/F5WAUMEawuUVRI5UYjwPEWl2zlUYY8YMgIjUW2vPBkPfSV6uYbKvJ+uW3rZSojfuABAEQdw5d96oajHQqr7P8IUqpL8X43lEjp3EK4mBtfgt75l4+4po7U3cytWZPbcyjUlTidv642ipDu7foX7bh2zgs92jDhHpUlWdbNmuEw15OvqweqE7ZjboCAEFADrSjs1LkRM7NAt3+bWRebfYudFx9XguwFqbwePs9z/mT/6NLdAHMBpex28W0/C1Y1Zy05VFM75nUwiAZVGT/v5sgdcA3UurOPUrxvXOFhJD7fOmdn4LeNc5NbpkfWimv5mWZ8KXFKdfXqInOYBnc6gsPEjZ8mKssbQOtvEkMczYl0oK8z3un4lgppbYkhZS3Fp7bnD0Jxeba+lODmTFviFcxq29NeRHDUEQ1DnnqtNSjohIo3Nutx+keNz9gmf9zfQkB0ChYMkK9q2KcaLwMJFQGFV9Y4w5YIwZzyBBI2lRLcD9PVXN/SdFqlokInUi0iEiE9P+UUTuqurmufl/AKTzsFGmvUNUAAAAAElFTkSuQmCC"></A>
        <DT><H3 ADD_DATE="1606928255" LAST_MODIFIED="1606928255">Folder_1</H3>
        <DL><p>
            <DT><A HREF="https://stackoverflow.com/" ADD_DATE="1605695883" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAABXklEQVQ4jbWQsUsCYRjGn/fuSu/Sk3ALmlzNtoagKRqSaHMKGkKhEOV0KWispSXPQaglAnNobOgfaCyIcgicmxO9zFPv/N5WwTs5gt7x+5739/2eDwgw/bK67HcnBQG4Ag3L0LJ/BoBFDuDzTiGUCAywDC3bNbRtANCrwxaBziRZanAGcjADwR8AX1uGesEZyFGzXwO43VsKn07GaJa5lY/GMefUAYooEvaELDnCEW9M2I1V7GdPg04hlLAM7dYqqut67ftLNwdpMB5dgRfXdVMgHIFpx9egfbwYk0eDA2LKAWJMkK6cUOhOGdkpZmoQiy29OmwFq1AKb5CgQyakAXqQJKpELn/eJzPK1JKhPhHjk4EmMzUVmU/coVLkeXff672pk155YXUsxikCJQFeYVCSgCiAV920N311b+r37FslH413S+qaV86rggfIBbG38RRAN+2ZHzsTMKvGv80vvziHGAusG84AAAAASUVORK5CYII=">Stack Overflow - Where Developers Learn, Share, &amp; Build Careers</A>
            <DT><A HREF="https://stackexchange.com/" ADD_DATE="1605695914" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAABIUlEQVQ4ja2SvUoDQRSFvztZDSIKWwbxB2zs7eyCWGljYWUlCKKm0kfwEQRR8A0kaGEfW+2VgD+JGxMlELOsWxhJdizMrmtM1hVyYGDm3Pm4c5grtJXLaaM0UV8HTKJVH7fM43RamgCG75Yn7bOkUot/wADkU7V5YAVA+eaAyEIc+LXRpPbRitUolsTf7F64Ouqi9dbi0fGC89WqKRCK8B+4rwoirB3d/YpQqDa4r753BUv7s9ERouC+6jvC3vmTSqixQsXmoWxHQhrcYUOnbk623SCCiCTjwO2uQw7JQYCEb47OLOWLFXsaeAGev5Z2QEYIjTxwKyI7Vnbj8keEXppaPgj/zmbxdOswXI81SICjIdMJ0/G0nvKUN2dlM9fdap8MMGR5HOUBZgAAAABJRU5ErkJggg==">Hot Questions - Stack Exchange</A>
            <DT><H3 ADD_DATE="1606928255" LAST_MODIFIED="1606928255">Subfolder</H3>
            <DL><p>
                <DT><A HREF="https://meta.stackexchange.com/" ADD_DATE="1605695986" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAAAm0lEQVQ4je2SsQ3CMBBF30XMglKyBB7DEiNATSpnADYAeQtgEcIyR2EiHONgU9DxSv/736fTFyKMv+1BHEW0u9i2B2imZnZlM4C45zyL+DFNz/HaUhzQN+nAJ3NOfwv4ln/ALwLGgsyR6lGRtBsLYvxQrLOg28kGoSDalZcOn51tewhBFRg/KIAim6tdnmKt+og5czXr4301pz0AqgIzDZOACvcAAAAASUVORK5CYII=">Meta Stack Exchange</A>
                <DT><H3 ADD_DATE="1606928255" LAST_MODIFIED="1606928255">Another Subfolder</H3>
                <DL><p>
                    <DT><A HREF="https://en.wikipedia.org/wiki/Main_Page" ADD_DATE="1605696025" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAABO0lEQVQ4jaWTMaoCMRCG/wnvDtELmHaxdAmIXcheZA9hYeMNxNZqsc81lu0X+2VLTzBj8V5C8uQ9UAcG5k+YP5kvhPATzCx4IZRSBAD0TnNuQu82J5NPmgFADcMAay2UUjifzwAA733S8zzDWgtrLeZ5xvV6xXK5hPcet9vte/5pmoSIJIQgURtj5HQ6CTOLc06maRJmFmaWuq5TjVg454qNrutEay0hBDkej8V6NC4M+r4XANL3fdo0xogxJul4UK4TxPV6Decc9vt9ArTb7XC/35MehgFVVZUUc7cQghCRjOOYTtNaS9d1wszStm3BgpnlKzfz3mO1WuFyuWCz2aBpGlhrcTgcsN1uAQCLxeLvG0RIRJRmjS9U13XB5wlinlrrgnTbtk/w/jWIDPL8PXvMzz9TzuLVZgB4AExRsO8ga8hoAAAAAElFTkSuQmCC">Wikipedia, the free encyclopedia</A>
                </DL><p>
            </DL><p>
            <DT><A HREF="https://www.wikipedia.org/" ADD_DATE="1605696017" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAABO0lEQVQ4jaWTMaoCMRCG/wnvDtELmHaxdAmIXcheZA9hYeMNxNZqsc81lu0X+2VLTzBj8V5C8uQ9UAcG5k+YP5kvhPATzCx4IZRSBAD0TnNuQu82J5NPmgFADcMAay2UUjifzwAA733S8zzDWgtrLeZ5xvV6xXK5hPcet9vte/5pmoSIJIQgURtj5HQ6CTOLc06maRJmFmaWuq5TjVg454qNrutEay0hBDkej8V6NC4M+r4XANL3fdo0xogxJul4UK4TxPV6Decc9vt9ArTb7XC/35MehgFVVZUUc7cQghCRjOOYTtNaS9d1wszStm3BgpnlKzfz3mO1WuFyuWCz2aBpGlhrcTgcsN1uAQCLxeLvG0RIRJRmjS9U13XB5wlinlrrgnTbtk/w/jWIDPL8PXvMzz9TzuLVZgB4AExRsO8ga8hoAAAAAElFTkSuQmCC">Wikipedia</A>
            <DT><A HREF="https://en.wikipedia.org/wiki/Beautiful_Soup_(HTML_parser)" ADD_DATE="1605696102" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAABO0lEQVQ4jaWTMaoCMRCG/wnvDtELmHaxdAmIXcheZA9hYeMNxNZqsc81lu0X+2VLTzBj8V5C8uQ9UAcG5k+YP5kvhPATzCx4IZRSBAD0TnNuQu82J5NPmgFADcMAay2UUjifzwAA733S8zzDWgtrLeZ5xvV6xXK5hPcet9vte/5pmoSIJIQgURtj5HQ6CTOLc06maRJmFmaWuq5TjVg454qNrutEay0hBDkej8V6NC4M+r4XANL3fdo0xogxJul4UK4TxPV6Decc9vt9ArTb7XC/35MehgFVVZUUc7cQghCRjOOYTtNaS9d1wszStm3BgpnlKzfz3mO1WuFyuWCz2aBpGlhrcTgcsN1uAQCLxeLvG0RIRJRmjS9U13XB5wlinlrrgnTbtk/w/jWIDPL8PXvMzz9TzuLVZgB4AExRsO8ga8hoAAAAAElFTkSuQmCC">Beautiful Soup (HTML parser) - Wikipedia</A>
        </DL><p>
        <DT><H3 ADD_DATE="1606928255" LAST_MODIFIED="1606928255">Folder_2</H3>
        <DL><p>
            <DT><A HREF="https://www.reddit.com/" ADD_DATE="1605696212" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAACdUlEQVQ4jWXTTYjVdRTG8c/5/e+945hNMhLEgFRUA1lgNYHojJjpDXduQkSqjRiKJNSu2rRJV0VQi14IKohCW8UEIRJZ05uUGESrBBOsJocayRznvvxPizup0dmcxeF5zoHzfAOSCBKybRP2SpO4SUjMSjOKV+Ooz67VhKXKdUZc75CwWzGk/neAQBXUuSi9oevp+NRFiCRMGDHqiIa2rkSNgqu3Xc5aUQwLfR9bsCO+8FcjyLzBIQ1tPR2hIVWD7ZEeforMsPLGyh331g7u6rr4xzbDnseBkltNqezW0xfRIIqqpDrSytHQGg7Tb6XZM6mzWNDQUwuP5xbrI9veU+xU61tUBHpCB+vWs3qcd9+mhYbUxDK1UKm905AmpdQTbh2n2wnj97F5F7fcRd1j7WaOv89Pp0JzKP36c2hKTMmHymJulHlgQ51/X8i8MJdZ9/N/1e9lzp3LnD+fuW+izo0y26VTXPljDvrIKl5+gn330+/R67B3gjefYdUYzSFKdUVWZD1reaRvv0rffJT6fe6epP0YVYNGi8nt3LxmYHjyWDp1Ii2Xsj5bpM+VCEX6+kNZVTy4i3s2Mf878+fZsJ32o7JqcGJ6EK8SIcxUz91mVp2PaEZx+gcREcYn0ty5cPp7fjvD8HVpZFQceSEcfjG1hDo7evYH5BavGLJfR8dlDXeuLR7YkcZuH8T4l9Ph+Af8eLLW0tfS1PVSHPPkIMqTVhh2WNM2/UgLWesITQNWulJLGo6iytA1rdjpqEtXYVpjhTEHhT2qsoygXiKqlAFVvXoBr/nTs/GdS5Y4+y/OW01hD6awesn/rDCj7/X4xJfXav4BhnocQyGrEocAAAAASUVORK5CYII=">reddit: the front page of the internet</A>
            <DT><A HREF="https://www.youtube.com/" ADD_DATE="1574152707" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAABx0lEQVQ4jZ2TQWtTQRDHfzO7yUsNKSG0heJJ0YKnCvVSkHrV7+BBeu7Vk9+lH8CLN6EXk4Lo1V48lGClFBELGmmSmr73djy8fS8vll4c+LO7s7OzM///LhQmBmrg+uDtZrgIBQQAKyf/YQbiBexFt9t9niSP7ji3umrW7og0UfWL0ZaOzGYjmBxn2c/BdPpJxuNz+vDwFxxnqmaqZs4V8L5AuS6haqmqXcDpO3jCEN4YmEFmkMcxrSCSGqSh8GcGIY52Am91Ce5HJ5EYBRwbG45222HmUHVS+DX2DhASuKu3oAeolGSqCiDs7gqHh8L2thCCxOQSbxFAO9BToClzNQSJokwmsLUFgwHs78P6eklnlQho6c0axUKbTVhZgUYjHpeqCwE8kMWFxYNF9uVlGA5hbw8ODuqJzajqnPENPtfYDxU2N4OtrRVzkVDfC0V8/h2G/g98AR7ESkJF8tFRcYdzkOf15iRW6y/hlD48voCz+BbmELFrvhrG8OMjPBOAV3D7qXM791R73Var00qSJRoNTyifB5Dn+eVsNv19dTX+mmWj93n+4SWciM1LKkmqy7RoImFBqPqPfH39K7t/4A18f74nAH8Bjm35s3ZkOjEAAAAASUVORK5CYII=">YouTube</A>
            <DT><H3 ADD_DATE="1606928255" LAST_MODIFIED="1606928255">stuff</H3>
            <DL><p>
                <DT><A HREF="https://www.pgadmin.org/" ADD_DATE="1566393697" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAAC4ElEQVQ4jU2TS2hdZRSFv7XPf26eNbV52jRp0tYYiVVDfUBBlEpHTuxIOhEzMHUi6EjswIEOBQV1YsFWyMCJCI6sGhBRJ1JDrVpMGnuJSXprTFqpjXmc8//bwb0R93zvvdb+9tL4qfeetBDOxFgMmeMOBiAhAHdHEu44gCC5kIXKQiz8VJA4CwwGKbowASbh7sSUGs2OEA44mHtyT3FY5ueCuw9sb27Gje3SKsEoo1PGqKY8o6UpR4AkortM8vrATMQYHfpDEcs40LPbHhrZq9mltdS5q8UO7u30mSvXdKlawx2KoqS1uclvb24Jl+fBMJOFzFJY3yjsscNDevXkExRlVB6yhlU4/cF5vvtlgVeeeZzDw336dfFPr9Zu6shIP1NfzvDZ97MKkrwsk2JKrG9u89qH0/TtaefFp4+ycnOd9186wehgD9dv/M3YUK+OH7kbgG9+rnoRnYAnudcPNz3zG1PTMwx072a47052tTUxOtjD5YU/mHjzY279s8WZl09wdGw/G1uFJAhIyOqHuv9AH2P7exkd6Karow1PTh0lxOjEmADIzGhQJngDG8DIvi59/fYkmRkA585foFq7wdhQL5++8Szg9Hd14O6Y1fEGAWVMxJT49qcFv7K8qkP9nVycrzG3tMrzb33C5FOP8PA9+7hUvc7iyl88eu8gRRkbChwPmSkz4/bGFqfPfk7XHW2UMSLBsfGDzF9b46OvfqS7o43XnzvukjS7tOZ5CIQQTEtrt/hhbtnnllfp6Wj31uZcABtbBSePPcD4oX7+V5r6YoYLc0tqb8nR+AvvFEWZzJMbUqoEM5CDK7nT1lzhvqFeDty1x4sy6vLvK35xvkYlD5jhenDy3WQiIdlOeMC1s66Mic3tkpgSIK+EjNbmHJyUIAtm2SKZBr0sI40k7igAkYeMpjyA5Dg4TkzJLcszeVo20ITQVRcSpP+MSvXsAcnrPxBTwpMnSUhW9eQT/wJc4GRalsmdmQAAAABJRU5ErkJggg==">pgAdmin - PostgreSQL Tools</A>
            </DL><p>
            <DT><A HREF="https://www.gnu.org/software/emacs/" ADD_DATE="1605696341" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAADN0lEQVQ4jW2TXWibZRiG7/f7S76k+WKS/iTNTLRzSSSr60TmWEVQ8GdSJjpxIIh4oB4peqQFj/TAE0EEEakgKII60ENxYwpSmTQd27Ru0OiSNj9Nk7Rp/r6kyfe+7+PB1qFj99HzwH0/XAfPzXAbzU29HfONed/SXeaklJLtDvq5Tm3j4zPrC5Vbvey/y8np+dTI6PjnoVj0kP/OsE/RVEBI8N0hGmul1na5cmGrWnj57LVPi3sZdW949vC7J8IHEqfjR6bTZtByMcZAgjAaNJFMhWBafrdrZHSKhuK5qJleztZ/L9w8cCo1nxhPHjgduS8RBWOAIJCQGAuZCAbcuP9IBI8cvxuXzm/A4wlZw17v0ZAS+zbfvGhrAGCGxxfCBxNREBAJe3Bsdh98PgM+y4DHq2OwK1DMteDYnMAlYlPpeLu9vQDgae2J+JsRKxqeYSAQJ/h9LqTSoyjkm/j6sz/Q2LRBnOg6FQFCQhGEgH/swccnXvSqDySPvxObST/GcB29XrGxsd7GwZlx6LqCarFLQ5uDhAQEAYIghYRLNTyVbrGkqIa+X1FUgBOIS4BLbFVsLJ5Zo2jMwuvvHWNPnkowt65CcgnJJcAJpjLCDOae0SAJ4BIkCJbPwNzzSexPhSC4ZFcubGLp7Brl/txCr+vcIJDQuAJFMIAAzenvrvH+AKqi4amTCdxzbwjd1gBffJChnWoP8kYIgqBwQBUqNK6g37cxFPaK0rfrnzQKlTYcovXVHSICpCRE4xaCIRM6ADgE1WHQuArdUaFxFbVmoe60Ol8yAHjh6PuLyemjsyQlJiZHcPihKLsrFYThVgECVE2B129g8+82vp+/TLIjsFj66ccfsh/OaQBgd2qvlXJXz0X3JcOVfBuVf5rkcWuIJYKIxCwEJ7ysWx1Q9lwV6BBWaplStZF99eYnrtYz9UlvssSH/GHLG/AamopAyINmvofC8g7yv22jcrEFpyHxVz1TXu/mXvl145vL/+tCtrZ0JeCJ/2K3dg6xIQsqXZeOvgKdq2AOodzKd1e2lzLl7uozPxe+Wr5tG/c0G3spfYfb/4apuMNEQg7EsNwY1D46X/zu2q3efwF9w4d36At8owAAAABJRU5ErkJggg==">GNU Emacs - GNU Project</A>
            <DT><H3 ADD_DATE="1606928255" LAST_MODIFIED="1606928255">other stuff</H3>
            <DL><p>
                <DT><H3 ADD_DATE="1606928255" LAST_MODIFIED="1606928255">emacs</H3>
                <DL><p>
                    <DT><A HREF="https://www.gnu.org/software/emacs/download.html" ADD_DATE="1605696357" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAADN0lEQVQ4jW2TXWibZRiG7/f7S76k+WKS/iTNTLRzSSSr60TmWEVQ8GdSJjpxIIh4oB4peqQFj/TAE0EEEakgKII60ENxYwpSmTQd27Ru0OiSNj9Nk7Rp/r6kyfe+7+PB1qFj99HzwH0/XAfPzXAbzU29HfONed/SXeaklJLtDvq5Tm3j4zPrC5Vbvey/y8np+dTI6PjnoVj0kP/OsE/RVEBI8N0hGmul1na5cmGrWnj57LVPi3sZdW949vC7J8IHEqfjR6bTZtByMcZAgjAaNJFMhWBafrdrZHSKhuK5qJleztZ/L9w8cCo1nxhPHjgduS8RBWOAIJCQGAuZCAbcuP9IBI8cvxuXzm/A4wlZw17v0ZAS+zbfvGhrAGCGxxfCBxNREBAJe3Bsdh98PgM+y4DHq2OwK1DMteDYnMAlYlPpeLu9vQDgae2J+JsRKxqeYSAQJ/h9LqTSoyjkm/j6sz/Q2LRBnOg6FQFCQhGEgH/swccnXvSqDySPvxObST/GcB29XrGxsd7GwZlx6LqCarFLQ5uDhAQEAYIghYRLNTyVbrGkqIa+X1FUgBOIS4BLbFVsLJ5Zo2jMwuvvHWNPnkowt65CcgnJJcAJpjLCDOae0SAJ4BIkCJbPwNzzSexPhSC4ZFcubGLp7Brl/txCr+vcIJDQuAJFMIAAzenvrvH+AKqi4amTCdxzbwjd1gBffJChnWoP8kYIgqBwQBUqNK6g37cxFPaK0rfrnzQKlTYcovXVHSICpCRE4xaCIRM6ADgE1WHQuArdUaFxFbVmoe60Ol8yAHjh6PuLyemjsyQlJiZHcPihKLsrFYThVgECVE2B129g8+82vp+/TLIjsFj66ccfsh/OaQBgd2qvlXJXz0X3JcOVfBuVf5rkcWuIJYKIxCwEJ7ysWx1Q9lwV6BBWaplStZF99eYnrtYz9UlvssSH/GHLG/AamopAyINmvofC8g7yv22jcrEFpyHxVz1TXu/mXvl145vL/+tCtrZ0JeCJ/2K3dg6xIQsqXZeOvgKdq2AOodzKd1e2lzLl7uozPxe+Wr5tG/c0G3spfYfb/4apuMNEQg7EsNwY1D46X/zu2q3efwF9w4d36At8owAAAABJRU5ErkJggg==">GNU Emacs download - GNU Project</A>
                    <DT><A HREF="https://www.gnu.org/software/emacs/documentation.html" ADD_DATE="1605696393" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAADN0lEQVQ4jW2TXWibZRiG7/f7S76k+WKS/iTNTLRzSSSr60TmWEVQ8GdSJjpxIIh4oB4peqQFj/TAE0EEEakgKII60ENxYwpSmTQd27Ru0OiSNj9Nk7Rp/r6kyfe+7+PB1qFj99HzwH0/XAfPzXAbzU29HfONed/SXeaklJLtDvq5Tm3j4zPrC5Vbvey/y8np+dTI6PjnoVj0kP/OsE/RVEBI8N0hGmul1na5cmGrWnj57LVPi3sZdW949vC7J8IHEqfjR6bTZtByMcZAgjAaNJFMhWBafrdrZHSKhuK5qJleztZ/L9w8cCo1nxhPHjgduS8RBWOAIJCQGAuZCAbcuP9IBI8cvxuXzm/A4wlZw17v0ZAS+zbfvGhrAGCGxxfCBxNREBAJe3Bsdh98PgM+y4DHq2OwK1DMteDYnMAlYlPpeLu9vQDgae2J+JsRKxqeYSAQJ/h9LqTSoyjkm/j6sz/Q2LRBnOg6FQFCQhGEgH/swccnXvSqDySPvxObST/GcB29XrGxsd7GwZlx6LqCarFLQ5uDhAQEAYIghYRLNTyVbrGkqIa+X1FUgBOIS4BLbFVsLJ5Zo2jMwuvvHWNPnkowt65CcgnJJcAJpjLCDOae0SAJ4BIkCJbPwNzzSexPhSC4ZFcubGLp7Brl/txCr+vcIJDQuAJFMIAAzenvrvH+AKqi4amTCdxzbwjd1gBffJChnWoP8kYIgqBwQBUqNK6g37cxFPaK0rfrnzQKlTYcovXVHSICpCRE4xaCIRM6ADgE1WHQuArdUaFxFbVmoe60Ol8yAHjh6PuLyemjsyQlJiZHcPihKLsrFYThVgECVE2B129g8+82vp+/TLIjsFj66ccfsh/OaQBgd2qvlXJXz0X3JcOVfBuVf5rkcWuIJYKIxCwEJ7ysWx1Q9lwV6BBWaplStZF99eYnrtYz9UlvssSH/GHLG/AamopAyINmvofC8g7yv22jcrEFpyHxVz1TXu/mXvl145vL/+tCtrZ0JeCJ/2K3dg6xIQsqXZeOvgKdq2AOodzKd1e2lzLl7uozPxe+Wr5tG/c0G3spfYfb/4apuMNEQg7EsNwY1D46X/zu2q3efwF9w4d36At8owAAAABJRU5ErkJggg==">GNU Emacs documentation - GNU Project</A>
                </DL><p>
            </DL><p>
            <DT><A HREF="https://orgmode.org/" ADD_DATE="1605696413" ICON="data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABAAAAAQCAYAAAAf8/9hAAAClElEQVQ4jZ1TzU8TcRSc/f36Ddt2aWkLKbi0FCSUCCSiRUTExCBGYyJKotGLHv03jJ41Xjx48aQXPRA1xujBGJRoJCIYAhQo1NKvbZfdbbct2/WkIQIhOqdJ3ps5vJlHsT9qAOgAqrsN6V6qjqGhcQeloeELp29bOJe0sbI6/y8Gnu5TA89bONvNoxWh3dQaOOkd6DMvTn6JApC2LzLbOHsu5D7R6HZ3NVD9hnJxPFh2e0BKKlS1jHq+CbGZH6vxleV7qiDFqtDLm8mNZQYAejxs5Erk0OP2poYgCjLKiTXMRYah9EUAWYYsiLCxNtjsLKpaFVJOhMPNYeHz1xgJ2s2t1473vuoPtwWh69jMZaGAAZn6iFJsDQylcHpcMFktIJTAaDGBUAoxm0d6PfnCcL63/c7hUIs9KxehCBlUJAnR5hAyfh7leALFLQ2s046ipCATi7/TCSNpqtaQTSZL0+9f3zX467j+LR1Q8wK2ZBESMUI62AlHOAxhaRkr07PfNtP5h7KYi868/fDy72sbNB1qQUijJArQ1BIS/laAUmiyAr1aRUGS30w+m3iwV9wklU7PavkMNFlG1N0IJRAC4/XBQBgYTCawdY4RH8/zexlQO0NFr4u9muI8SAfaYA53wcAAhBAYjEYQSus9LU3XWadzOb4QndthsCjKC2WrNWs6Mzpay/NgCANCKXRdh8lqgdlmRbVSsbJu7pLFap//ubj0fUcTo8nclNXnzXsDB0ac9S4YTEYwhKAgSqioKkpFFZYaG2rr7GOlMp4K6+uZ3ZqIjsGewc4jA5eNFqNDSmXnYrG1T0o8keK8PgfX6DrmavbfknJifuL+o26GYXZ9rv3ARsbOPmmL9Jz6H/Ef8Dzv/M1/AdxXB/z0rsGnAAAAAElFTkSuQmCC">Org mode for Emacs</A>
        </DL><p>
</DL><p>

From this file, I want to extract the following information:

  1. URL
  2. Description
  3. Add Date
  4. Folder hierarchy / path of the URL

I got the first three requirements working with BeautifulSoup, but I can't seem to get the 4th working. So I'll try to explain this point a bit further.

Let's assume the following folder hierarchy:

Bookmarks 
\_Bookmarks bar
  \_Folder_1
    \_Subfolder
      \_Another Subfolder
  \_Folder_2
    \_stuff
    \_other stuff
      \_emacs

Ideally, I'd like to have the following exemplary output for a URL in 'Another Subfolder':

https://en.wikipedia.org/wiki/Main_Page
Wikipedia, the free encyclopedia
1605696025
Bookmarks bar/Folder_1/Subfolder/Another Subfolder

However this output would be very helpful already:

https://en.wikipedia.org/wiki/Main_Page
Wikipedia, the free encyclopedia
1605696025
Another Subfolder

My Code so far is:

from bs4 import BeautifulSoup

def read_in_file(filename): 
    f = open(filename, 'r') 
    soup = BeautifulSoup(f.read(), 'html.parser')
    f.close()
    return soup

soup = read_in_file('bookmarks_8_2_21.html')
for line in soup.find_all('a'):
    print(line.get('href'))     # 1) URL:         works 
    print(line.get_text())      # 2) Description: works 
    print(line.get('add_date')) # 3) Add Date:    works

    dir = soup.find('h3') # 4) Folder hierarch/ path: not working
    print(dir.contents)   # only prints ['Bookmarks bar']

    print()

Output of an entry so far:

https://en.wikipedia.org/wiki/Main_Page
Wikipedia, the free encyclopedia
1605696025
['Bookmarks bar']

I also experimented with siblings and found out how to print out the folder hierarchy, but I couldn't get it to work with the other code:

Code Snippet:

for dir in soup.find_all('h3', recursive=True):
    print(dir.text)

Output:

Bookmarks bar
Folder_1
Subfolder
Another Subfolder
Folder_2
stuff
other stuff
emacs

Thank you for any help and suggestions!


Solution

  • The problem was probably with how your bookmarks file got imported or how BS read that file. More specifically how it read the Description Term or <DT> element. This is because these tags are not closed in your exported file. Therefore it doesn't know where the tag should be closed thereby closing it some random places.

    So I closed the tags on the same line as they started and after that it should be easy for you to extract the data.

    from bs4 import BeautifulSoup
    
    soup = BeautifulSoup()
    with open('bookmarks.html') as f:
        soup = BeautifulSoup(f.read(), 'lxml')
    
    dt = soup.find_all('dt')
    folder_name =''
    for i in dt:
        n = i.find_next()
        if n.name == 'h3':
            folder_name = n.text
            continue
        else:
            print(f'url = {n.get("href")}')
            print(f'website name = {n.text}')
            print(f'add date = {n.get("add_date")}')
            print(f'folder name = {folder_name}')
        print()
    

    This small section of o/p would be helpful I hope:

    url = https://stackoverflow.com/
    website name = Stack Overflow - Where Developers Learn, Share, & Build Careers
    add date = 1605695883
    folder name = Folder_1
    
    url = https://stackexchange.com/
    website name = Hot Questions - Stack Exchange
    add date = 1605695914
    folder name = Folder_1
    
    url = https://meta.stackexchange.com/
    website name = Meta Stack Exchange
    add date = 1605695986
    folder name = Subfolder
    
    url = https://en.wikipedia.org/wiki/Main_Page
    website name = Wikipedia, the free encyclopedia
    add date = 1605696025
    folder name = Another Subfolder
    
    url = https://www.wikipedia.org/
    website name = Wikipedia
    add date = 1605696017
    folder name = Another Subfolder
    

    Here I have assumed that whatever link that was under a folder name it belonged to that folder but this may change because of the reason I've added below.

    If you want to be more accurate with the result then you should consider closing the p tags also as they are also left open to be filled out anywhere.

    The way forward to do that would be to by finding the dl tags and traversing inside them separately to find out which dt tag comes under which folder or dl element precisely.

    This a very specific type of problem because not all of us save the bookmarks in the same manner. Also you have to note that the html varies according to the organization of the folder. Eg: If links come first or sub-folder comes first, accordingly the html file will also change.