Search code examples
pythonpython-3.xweb-scrapingbeautifulsoupcanonical-link

Find the canonical link in a FILE type file - BeautifulSoup


I have many FILE type files (the files that are saved on your system without any extension). These files contain HTML parsed content of news - websites. I need to find the canonical link (URL) hidden there. I am using this code to test for one of the files first -

with open(file, 'r') as f:
    html_text = f.read()
soup = BeautifulSoup(html_text, 'html.parser')

link = soup.find('link', rel = 'canonical')

But I am getting a NoneType object error. I also tried these variations

# Variation 1
link = soup.find('link', {'rel':'canonical'})
# Variation 2
link = soup.find('link', rel = 'canonical')['href']
# Variation 3
link = soup.find('link', {'rel':'canonical'}).get['href']
# Variation 4
link = soup.find('link', {'rel':'canonical'})['href']

I also tried a soup.find_all variation, but these failed too. (Errors: NoneType object is not subscriptable/NoneType object doesn't have attribute href)

I manually checked my file by opening it in Notepad and I find that there is a snip in there <link rel=\"canonical\" href=\"https://theintercept.com/2020/10/27/senator-perdue-ossoff-china/feed/\"/> proving that the canonical is indeed NOT a NoneType object.

This seems like such a simple problem but there seems to be something wrong that I am not able to catch. I went through a lot of questions on StackOverflow that deal with similar issues and tried out their solutions (hence the variations). Any help is appreciated.

Edit - Adding the file content upon request

"<!DOCTYPE html>\n<!--\n ______  __              ______          __                                       __\n/\\__  _\\/\\ \\            /\\__  _\\        /\\ \\__                                   /\\ \\__\n\\/_/\\ \\/\\ \\ \\___      __\\/_/\\ \\/     ___\\ \\ ,_\\    __   _ __   ___     __   _____\\ \\ ,_\\\n   \\ \\ \\ \\ \\  _ `\\  /'__`\\ \\ \\ \\   /' _ `\\ \\ \\/  /'__`\\/\\`'__\\/'___\\ /'__`\\/\\ '__`\\ \\ \\/\n    \\ \\ \\ \\ \\ \\ \\ \\/\\  __/  \\_\\ \\__/\\ \\/\\ \\ \\ \\_/\\  __/\\ \\ \\//\\ \\__//\\  __/\\ \\ \\L\\ \\ \\ \\_\n     \\ \\_\\ \\ \\_\\ \\_\\ \\____\\ /\\_____\\ \\_\\ \\_\\ \\__\\ \\____\\\\ \\_\\\\ \\____\\ \\____\\\\ \\ ,__/\\ \\__\\\n      \\/_/  \\/_/\\/_/\\/____/ \\/_____/\\/_/\\/_/\\/__/\\/____/ \\/_/ \\/____/\\/____/ \\ \\ \\/  \\/__/\n                                                                              \\ \\_\\\n                                                                               \\/_/\n-->\n<html lang=\"en\">\n  <head>\n    <title>The Intercept</title>\n    <meta charset=\"utf-8\">\n    <meta http-equiv=\"X-UA-Compatible\" content=\"IE=edge\">\n    <meta name=\"viewport\" content=\"width=device-width, initial-scale=1, maximum-scale=1, user-scalable=no\">\n    <meta name=\"msapplication-TileColor\" content=\"#000000\">\n    <meta name=\"msapplication-TileImage\" content=\"/static/mstile-144x144.png\">\n    <meta name=\"msapplication-config\" content=\"/static/browserconfig.xml\">\n    <meta name=\"theme-color\" content=\"#ffffff\">\n    <meta property=\"og:url\" content=\"https://theintercept.com/2020/10/27/senator-perdue-ossoff-china/feed/\">\n    <link rel=\"apple-touch-icon\" sizes=\"57x57\" href=\"/static/apple-touch-icon-57x57.png\">\n    <link rel=\"apple-touch-icon\" sizes=\"60x60\" href=\"/static/apple-touch-icon-60x60.png\">\n    <link rel=\"apple-touch-icon\" sizes=\"72x72\" href=\"/static/apple-touch-icon-72x72.png\">\n    <link rel=\"apple-touch-icon\" sizes=\"76x76\" href=\"/static/apple-touch-icon-76x76.png\">\n    <link rel=\"apple-touch-icon\" sizes=\"114x114\" href=\"/static/apple-touch-icon-114x114.png\">\n    <link rel=\"apple-touch-icon\" sizes=\"120x120\" href=\"/static/apple-touch-icon-120x120.png\">\n    <link rel=\"apple-touch-icon\" sizes=\"144x144\" href=\"/static/apple-touch-icon-144x144.png\">\n    <link rel=\"apple-touch-icon\" sizes=\"152x152\" href=\"/static/apple-touch-icon-152x152.png\">\n    <link rel=\"apple-touch-icon\" sizes=\"180x180\" href=\"/static/apple-touch-icon-180x180.png\">\n    <link rel=\"icon\" type=\"image/png\" href=\"/static/favicon-32x32.png\" sizes=\"32x32\">\n    <link rel=\"icon\" type=\"image/png\" href=\"/static/android-chrome-192x192.png\" sizes=\"192x192\">\n    <link rel=\"icon\" type=\"image/png\" href=\"/static/favicon-96x96.png\" sizes=\"96x96\">\n    <link rel=\"icon\" type=\"image/png\" href=\"/static/favicon-16x16.png\" sizes=\"16x16\">\n    <link rel=\"manifest\" href=\"/static/manifest.json\">\n    <link rel=\"shortcut icon\" href=\"/static/favicon.ico\">\n    <link rel=\"canonical\" href=\"https://theintercept.com/2020/10/27/senator-perdue-ossoff-china/feed/\"/>\n    \n    \n    \n    \n    <!--[if !IE]><!--><link rel=\"stylesheet\" type=\"text/css\" href=\"/assets/app42e762a729b53f810f04.css\"><!--<![endif]-->\n    <!--[if gte IE 9]><link rel=\"stylesheet\" type=\"text/css\" href=\"/assets/app42e762a729b53f810f04.css\"><![endif]-->\n    <!--[if lte IE 8]><link rel=\"stylesheet\" type=\"text/css\" href=\"/assets/ie842e762a729b53f810f04.css\"><![endif]-->\n    \n    <!--[if lte IE 8]>\n    <script>\n      document.createElement('header');\n      document.createElement('nav');\n      document.createElement('section');\n      document.createElement('article');\n      document.createElement('aside');\n      document.createElement('footer');\n      document.createElement('hgroup');\n      document.createElement('picture');\n    </script>\n    <![endif]-->\n    <script id=\"ad-block-test\" src=\"/ads.js\" data-blocked=\"true\"></script>\n  </head>\n  <body>\n    <script src=\"/assets/sniffer42e762a729b53f810f04.js\"></script>\n    <div id=\"Root\"><div class=\"InterceptWrapper\" data-reactroot=\"\" data-reactid=\"1\" data-react-checksum=\"1442884202\"><div data-reactid=\"2\"><!-- react-empty: 3 --><!-- react-empty: 4 --></div><div class=\"Header Header--en Header--route-theintercept\" data-reactid=\"5\"><span data-reactid=\"6\"><div class=\"Header-hamburger\" data-reactid=\"7\"><a class=\"Header-hamburger-link\" style=\"color:;\" href=\"/2020/10/27/senator-perdue-ossoff-china/feed/?menu=1\" data-reactid=\"8\"><span class=\"Icon Icon--Menu icon-TI_Menu\" data-reactid=\"9\"></span></a></div><nav class=\"Header-menu\" data-reactid=\"10\"><div class=\"Logo\" data-reactid=\"11\"><div class=\"Logo-bg-block\" data-reactid=\"12\"><div class=\"GridContainer\" data-reactid=\"13\"><div class=\"GridRow\" data-reactid=\"14\"><div class=\"Logo-bg\" data-reactid=\"15\"></div></div></div></div><div class=\"Logo-block\" data-reactid=\"16\"><a href=\"/\" data-reactid=\"17\"><span class=\"Logo-fallback\" style=\"color:#111;\" data-reactid=\"18\"><!-- react-text: 19 -->The<!-- /react-text --><br data-reactid=\"20\"/><!-- react-text: 21 -->Intercept_<!-- /react-text --><span data-reactid=\"22\"><br data-reactid=\"23\"/><!-- react-text: 24 --><!-- /react-text --></span></span><svg class=\"Logo-svg\" height=\"50px\" version=\"1.1\" viewBox=\"0 0 140 50\" width=\"140px\" data-reactid=\"25\"><g data-reactid=\"26\"><path class=\"Logo-path\" d=\"M51.731,30.458c1.246,0,2.264,1.425,2.264,3.206l-4.605,0.56C49.517,31.781,50.28,30.458,51.731,30.458 M40.789,8.601 c1.247,0,2.265,1.424,2.265,3.206l-4.606,0.559C38.575,9.924,39.339,8.601,40.789,8.601 M92.774,30.458 c1.247,0,2.264,1.425,2.264,3.206l-4.605,0.56C90.56,31.781,91.323,30.458,92.774,30.458 M128.295,46.463H140v-2.188h-11.705 V46.463z M106.642,31.679c0.279-0.101,0.61-0.178,1.272-0.178c2.544,0,4.275,1.68,4.275,5.42c0,3.104-1.705,5.216-4.173,5.216 c-0.408,0-0.891-0.076-1.374-0.229V31.679z M68.652,33.206h3.18v-4.097c-0.61-0.254-1.017-0.356-1.603-0.356 c-0.992,0-2.188,0.662-3.435,1.654l-0.916,0.713v-2.367h-0.661l-6.285,1.272v0.967l2.112,0.637v10l-2.112,0.763v0.865h9.313v-0.865 l-2.367-0.763v-9.39c0.611-0.229,1.68-0.407,1.934-0.407L68.652,33.206z M80.484,28.753c-3.995,0-8.372,2.494-8.372,8.066 c0,4.631,3.079,6.82,6.412,6.82c1.146,0,2.469-0.153,4.402-0.967l2.341-0.993l-0.33-0.916c-0.586,0.229-1.4,0.382-2.545,0.382 c-3.155,0-5.547-1.883-5.547-6.132c0-2.952,1.705-4.555,3.257-4.555c0.153,0,0.331,0.025,0.484,0.102l1.399,2.646h2.926v-3.613 C83.741,29.16,82.138,28.753,80.484,28.753 M123.792,26.489h-1.171l-5.521,3.18v1.069l1.857,0.178v10.076 c0,1.807,1.425,2.647,3.435,2.647c0.942,0,2.697-0.306,3.868-0.662l2.035-0.636l-0.178-0.941c-0.33,0.076-0.916,0.101-1.246,0.101 c-1.858,0-3.079-0.687-3.079-2.977v-7.455l4.351,0.229v-2.163h-4.351V26.489z M40.128,26.489h-1.171l-5.522,3.18v1.069l1.858,0.178 v10.076c0,1.807,1.425,2.647,3.435,2.647c0.942,0,2.697-0.306,3.868-0.662l2.035-0.636L44.453,41.4 c-0.331,0.076-0.916,0.101-1.247,0.101c-1.857,0-3.078-0.687-3.078-2.977v-7.455l4.351,0.229v-2.163h-4.351V26.489z M58.626,35.293 v-0.407c0-2.189-1.476-6.133-6.437-6.133c-3.766,0-7.558,2.494-7.558,8.066c0,4.555,3.334,6.82,6.972,6.82 c1.069,0,2.621-0.178,4.428-0.967l2.264-0.993l-0.33-0.916c-0.586,0.229-1.4,0.382-2.596,0.382c-2.799,0-5.878-1.628-6.005-5.852 H58.626z M99.67,35.293v-0.407c0-2.189-1.476-6.133-6.438-6.133c-3.766,0-7.557,2.494-7.557,8.066c0,4.555,3.333,6.82,6.972,6.82 c1.068,0,2.62-0.178,4.427-0.967l2.265-0.993l-0.331-0.916c-0.585,0.229-1.4,0.382-2.596,0.382c-2.798,0-5.877-1.628-6.005-5.852 H99.67z M47.685,13.435v-0.407c0-2.188-1.476-6.132-6.438-6.132c-3.766,0-7.557,2.493-7.557,8.066c0,4.555,3.333,6.819,6.972,6.819 c1.069,0,2.621-0.178,4.427-0.967l2.265-0.992l-0.331-0.916c-0.585,0.229-1.399,0.382-2.595,0.382 c-2.799,0-5.878-1.629-6.005-5.853H47.685z M6.438,25.598v15.725L3.97,42.265v0.992h10.865v-0.992l-2.468-0.942V25.598l2.468-0.942 v-0.992H3.97v0.992L6.438,25.598z M31.781,41.629V33.74c0-2.926-1.094-4.987-4.045-4.987c-1.222,0-2.138,0.382-3.334,0.916 l-2.163,0.942v-1.858h-0.661l-6.285,1.272v0.967l2.112,0.637v10l-2.112,0.763v0.865h8.804v-0.865l-1.858-0.763v-9.924 c0.458-0.127,1.247-0.204,1.833-0.204c1.577,0,2.875,0.865,2.875,3.461v6.667l-1.858,0.763v0.865h8.804v-0.865L31.781,41.629z M106.642,28.753h-0.662l-6.285,1.272v0.967l2.112,0.637v15.598l-2.112,0.763v0.865h9.567V47.99l-2.62-0.763v-3.588h0.992 c5.954,0,9.287-3.614,9.287-8.626c0-4.301-3.002-6.26-5.241-6.26c-0.891,0-2.443,0.535-3.868,1.247l-0.967,0.483h-0.203V28.753z M31.527,19.771v-7.888c0-2.926-1.094-4.987-4.046-4.987c-1.221,0-2.137,0.381-3.333,0.916l-2.163,0.941V0h-0.662l-4.911,1.807H0 v4.911h2.316L3.257,3.69h3.181v15.776L3.97,20.407V21.4h10.61v-0.993l-2.213-0.941V3.69h4.783v16.081l-1.857,0.763V21.4h8.55 v-0.866l-1.858-0.763V9.873c0.458-0.127,1.247-0.229,1.832-0.229c1.578,0,2.875,0.865,2.875,3.46v6.667l-1.857,0.763V21.4h8.804 v-0.866L31.527,19.771z\" fill=\"#111\" data-reactid=\"27\"></path></g></svg></a></div></div><ul class=\"Header-language-list\" data-reactid=\"28\"><li class=\"Header-language-list-item Header-language-list-item--active\" data-reactid=\"29\"><a class=\"Header-language-link\" href=\"/\" data-reactid=\"30\">English</a></li><li class=\"Header-language-list-item\" data-reactid=\"31\"><a class=\"Header-language-link\" href=\"/brasil/\" data-reactid=\"32\">Portugu\u00eas</a></li></ul><div class=\"Header-search-block\" data-reactid=\"33\"><form action=\"/search\" type=\"get\" data-reactid=\"34\"><label class=\"Header-search-label\" for=\"search\" data-reactid=\"35\"><span class=\"Icon Icon--Search icon-TI_Search\" data-reactid=\"36\"></span></label><input id=\"search\" class=\"Header-search-input\" name=\"s\" data-reactid=\"37\"/></form></div><div class=\"Header-menu-mission-block\" data-reactid=\"38\"><ul class=\"Header-menu-list Header-menu-list--collection-items\" data-reactid=\"39\"><li class=\"Header-menu-list-item\" data-reactid=\"40\"><a class=\"Header-menu-link\" href=\"/politics/\" data-reactid=\"41\">Politics</a></li><li class=\"Header-menu-list-item\" data-reactid=\"42\"><a class=\"Header-menu-link\" href=\"/justice/\" data-reactid=\"43\">Justice</a></li><li class=\"Header-menu-list-item\" data-reactid=\"44\"><a class=\"Header-menu-link\" href=\"/national-security/\" data-reactid=\"45\">National Security</a></li><li class=\"Header-menu-list-item\" data-reactid=\"46\"><a class=\"Header-menu-link\" href=\"/world/\" data-reactid=\"47\">World</a></li><li class=\"Header-menu-list-item\" data-reactid=\"48\"><a class=\"Header-menu-link\" href=\"/technology/\" data-reactid=\"49\">Technology</a></li><li class=\"Header-menu-list-item\" data-reactid=\"50\"><a class=\"Header-menu-link\" href=\"/environment/\" data-reactid=\"51\">Environment</a></li></ul></div><div class=\"Header-menu-mission-block\" data-reactid=\"52\"><ul class=\"Header-menu-list Header-menu-list--mission-items\" data-reactid=\"53\"><li class=\"Header-menu-list-item\" data-reactid=\"54\"><a class=\"Header-menu-link\" href=\"/special-investigations/\" data-reactid=\"55\">Special Investigations</a></li><li class=\"Header-menu-list-item\" data-reactid=\"56\"><a class=\"Header-menu-link\" href=\"/voices/\" data-reactid=\"57\">Voices</a></li><li class=\"Header-menu-list-item\" data-reactid=\"58\"><a class=\"Header-menu-link\" href=\"/podcasts/\" data-reactid=\"59\">Podcasts</a></li><li class=\"Header-menu-list-item\" data-reactid=\"60\"><a class=\"Header-menu-link\" href=\"/videos/\" data-reactid=\"61\">Videos</a></li><li class=\"Header-menu-list-item\" data-reactid=\"62\"><a class=\"Header-menu-link\" href=\"/documents/\" data-reactid=\"63\">Documents</a></li><li class=\"Header-menu-list-item\" data-reactid=\"64\"><a class=\"Header-menu-link\" href=\"https://join.theintercept.com/donate/now?source=web_intercept_20200601_hamburger\" data-reactid=\"65\"><div class=\"Header-menu-list-item-button\" data-reactid=\"66\"><!-- react-text: 67 -->Become A Member<!-- /react-text --><span class=\"Icon Icon--Arrow_02_Right icon-TI_Arrow_02_Right\" data-reactid=\"68\"></span></div></a></li></ul></div><ul class=\"Header-menu-list Header-menu-list--content-items\" data-reactid=\"69\"><li class=\"Header-menu-list-item\" data-reactid=\"70\"><a class=\"Header-menu-link\" href=\"/about/\" data-reactid=\"71\">About</a></li><li class=\"Header-menu-list-item\" data-reactid=\"72\"><a class=\"Header-menu-link\" href=\"/policies/\" data-reactid=\"73\">Editorial Policies</a></li><li class=\"Header-menu-list-item\" data-reactid=\"74\"><a class=\"Header-menu-link\" href=\"/source/\" data-reactid=\"75\">Become a Source</a></li><li class=\"Header-menu-list-item\" data-reactid=\"76\"><a class=\"Header-menu-link\" href=\"/newsletter/?source=web_hamburger\" data-reactid=\"77\">Join Newsletter</a></li></ul><div class=\"Header-footer\" data-reactid=\"78\"><div class=\"Header-social-links\" data-reactid=\"79\"><a class=\"Header-social-link\" target=\"_blank\" data-label=\"facebook\" href=\"https://www.facebook.com/theinterceptflm\" data-reactid=\"80\"><span class=\"Icon Icon--Facebook icon-TI_Facebook\" data-reactid=\"81\"></span></a><a class=\"Header-social-link\" target=\"_blank\" data-label=\"twitter\" href=\"https://twitter.com/theintercept\" data-reactid=\"82\"><span class=\"Icon Icon--Twitter icon-TI_Twitter\" data-reactid=\"83\"></span></a><a class=\"Header-social-link\" target=\"_blank\" data-label=\"instagram\" href=\"https://www.instagram.com/theintercept/\" data-reactid=\"84\"><span class=\"Icon Icon--Instagram icon-TI_Instagram\" data-reactid=\"85\"></span></a><a class=\"Header-social-link\" target=\"_blank\" data-label=\"tumblr\" href=\"https://the-intercept.tumblr.com\" data-reactid=\"86\"><span class=\"Icon Icon--Tumblr icon-TI_Tumblr\" data-reactid=\"87\"></span></a><a class=\"Header-social-link\" target=\"_blank\" data-label=\"snapchat\" href=\"https://www.snapchat.com/add/theintercept\" data-reactid=\"88\"><span class=\"Icon Icon--Snapchat icon-TI_Snapchat\" data-reactid=\"89\"></span></a><a class=\"Header-social-link\" target=\"_blank\" data-label=\"flipboard\" href=\"https://flipboard.com/@TheIntercept\" data-reactid=\"90\"><span class=\"Icon Icon--Flipboard icon-TI_Flipboard\" data-reactid=\"91\"></span></a><a class=\"Header-social-link\" data-label=\"rss\" href=\"/feeds/\" data-reactid=\"92\"><span class=\"Icon Icon--RSS icon-TI_RSS\" data-reactid=\"93\"></span></a></div><img class=\"Header-FLM-svg\" src=\"/static/FLM.svg\" alt=\"First Look Media logo\" data-reactid=\"94\"/><p class=\"Header-TM\" data-reactid=\"95\">The Intercept is a First Look Media Company.</p><cite class=\"Header-copyright\" data-reactid=\"96\"><!-- react-text: 97 -->\u00a9 First Look Media. <!-- /react-text --><span data-reactid=\"98\">All rights reserved</span></cite><ul class=\"Header-footer-list\" data-reactid=\"99\"><li class=\"Header-footer-list-item\" data-reactid=\"100\"><a class=\"Header-footer-link\" data-label=\"terms of use\" href=\"/terms-use/\" data-reactid=\"101\"><span data-reactid=\"102\">Terms of use</span></a></li><li class=\"Header-footer-list-item\" data-reactid=\"103\"><a class=\"Header-footer-link\" data-label=\"privacy policy\" href=\"/privacy-policy/\" data-reactid=\"104\"><span data-reactid=\"105\">Privacy</span></a></li></ul></div></nav></span></div><div class=\"ErrorPage\" data-reactid=\"106\"><div class=\"Logo\" data-reactid=\"107\"><div class=\"Logo-block\" data-reactid=\"108\"><a href=\"/\" data-reactid=\"109\"><span class=\"Logo-fallback\" style=\"color:#fff;\" data-reactid=\"110\"><!-- react-text: 111 -->The<!-- /react-text --><br data-reactid=\"112\"/><!-- react-text: 113 -->Intercept_<!-- /react-text --><span data-reactid=\"114\"><br data-reactid=\"115\"/><!-- react-text: 116 --><!-- /react-text --></span></span><svg class=\"Logo-svg\" height=\"50px\" version=\"1.1\" viewBox=\"0 0 140 50\" width=\"140px\" data-reactid=\"117\"><g data-reactid=\"118\"><path class=\"Logo-path\" d=\"M51.731,30.458c1.246,0,2.264,1.425,2.264,3.206l-4.605,0.56C49.517,31.781,50.28,30.458,51.731,30.458 M40.789,8.601 c1.247,0,2.265,1.424,2.265,3.206l-4.606,0.559C38.575,9.924,39.339,8.601,40.789,8.601 M92.774,30.458 c1.247,0,2.264,1.425,2.264,3.206l-4.605,0.56C90.56,31.781,91.323,30.458,92.774,30.458 M128.295,46.463H140v-2.188h-11.705 V46.463z M106.642,31.679c0.279-0.101,0.61-0.178,1.272-0.178c2.544,0,4.275,1.68,4.275,5.42c0,3.104-1.705,5.216-4.173,5.216 c-0.408,0-0.891-0.076-1.374-0.229V31.679z M68.652,33.206h3.18v-4.097c-0.61-0.254-1.017-0.356-1.603-0.356 c-0.992,0-2.188,0.662-3.435,1.654l-0.916,0.713v-2.367h-0.661l-6.285,1.272v0.967l2.112,0.637v10l-2.112,0.763v0.865h9.313v-0.865 l-2.367-0.763v-9.39c0.611-0.229,1.68-0.407,1.934-0.407L68.652,33.206z M80.484,28.753c-3.995,0-8.372,2.494-8.372,8.066 c0,4.631,3.079,6.82,6.412,6.82c1.146,0,2.469-0.153,4.402-0.967l2.341-0.993l-0.33-0.916c-0.586,0.229-1.4,0.382-2.545,0.382 c-3.155,0-5.547-1.883-5.547-6.132c0-2.952,1.705-4.555,3.257-4.555c0.153,0,0.331,0.025,0.484,0.102l1.399,2.646h2.926v-3.613 C83.741,29.16,82.138,28.753,80.484,28.753 M123.792,26.489h-1.171l-5.521,3.18v1.069l1.857,0.178v10.076 c0,1.807,1.425,2.647,3.435,2.647c0.942,0,2.697-0.306,3.868-0.662l2.035-0.636l-0.178-0.941c-0.33,0.076-0.916,0.101-1.246,0.101 c-1.858,0-3.079-0.687-3.079-2.977v-7.455l4.351,0.229v-2.163h-4.351V26.489z M40.128,26.489h-1.171l-5.522,3.18v1.069l1.858,0.178 v10.076c0,1.807,1.425,2.647,3.435,2.647c0.942,0,2.697-0.306,3.868-0.662l2.035-0.636L44.453,41.4 c-0.331,0.076-0.916,0.101-1.247,0.101c-1.857,0-3.078-0.687-3.078-2.977v-7.455l4.351,0.229v-2.163h-4.351V26.489z M58.626,35.293 v-0.407c0-2.189-1.476-6.133-6.437-6.133c-3.766,0-7.558,2.494-7.558,8.066c0,4.555,3.334,6.82,6.972,6.82 c1.069,0,2.621-0.178,4.428-0.967l2.264-0.993l-0.33-0.916c-0.586,0.229-1.4,0.382-2.596,0.382c-2.799,0-5.878-1.628-6.005-5.852 H58.626z M99.67,35.293v-0.407c0-2.189-1.476-6.133-6.438-6.133c-3.766,0-7.557,2.494-7.557,8.066c0,4.555,3.333,6.82,6.972,6.82 c1.068,0,2.62-0.178,4.427-0.967l2.265-0.993l-0.331-0.916c-0.585,0.229-1.4,0.382-2.596,0.382c-2.798,0-5.877-1.628-6.005-5.852 H99.67z M47.685,13.435v-0.407c0-2.188-1.476-6.132-6.438-6.132c-3.766,0-7.557,2.493-7.557,8.066c0,4.555,3.333,6.819,6.972,6.819 c1.069,0,2.621-0.178,4.427-0.967l2.265-0.992l-0.331-0.916c-0.585,0.229-1.399,0.382-2.595,0.382 c-2.799,0-5.878-1.629-6.005-5.853H47.685z M6.438,25.598v15.725L3.97,42.265v0.992h10.865v-0.992l-2.468-0.942V25.598l2.468-0.942 v-0.992H3.97v0.992L6.438,25.598z M31.781,41.629V33.74c0-2.926-1.094-4.987-4.045-4.987c-1.222,0-2.138,0.382-3.334,0.916 l-2.163,0.942v-1.858h-0.661l-6.285,1.272v0.967l2.112,0.637v10l-2.112,0.763v0.865h8.804v-0.865l-1.858-0.763v-9.924 c0.458-0.127,1.247-0.204,1.833-0.204c1.577,0,2.875,0.865,2.875,3.461v6.667l-1.858,0.763v0.865h8.804v-0.865L31.781,41.629z M106.642,28.753h-0.662l-6.285,1.272v0.967l2.112,0.637v15.598l-2.112,0.763v0.865h9.567V47.99l-2.62-0.763v-3.588h0.992 c5.954,0,9.287-3.614,9.287-8.626c0-4.301-3.002-6.26-5.241-6.26c-0.891,0-2.443,0.535-3.868,1.247l-0.967,0.483h-0.203V28.753z M31.527,19.771v-7.888c0-2.926-1.094-4.987-4.046-4.987c-1.221,0-2.137,0.381-3.333,0.916l-2.163,0.941V0h-0.662l-4.911,1.807H0 v4.911h2.316L3.257,3.69h3.181v15.776L3.97,20.407V21.4h10.61v-0.993l-2.213-0.941V3.69h4.783v16.081l-1.857,0.763V21.4h8.55 v-0.866l-1.858-0.763V9.873c0.458-0.127,1.247-0.229,1.832-0.229c1.578,0,2.875,0.865,2.875,3.46v6.667l-1.857,0.763V21.4h8.804 v-0.866L31.527,19.771z\" fill=\"#fff\" data-reactid=\"119\"></path></g></svg></a></div></div><div class=\"GridContainer\" data-reactid=\"120\"><div class=\"GridRow\" data-reactid=\"121\"><div class=\"ErrorPage-container\" data-reactid=\"122\"><h2 class=\"ErrorPage-pagetitle\" data-reactid=\"123\">Error 404</h2><h1 class=\"ErrorPage-title\" data-reactid=\"124\">Page not found</h1><p class=\"ErrorPage-text\" data-reactid=\"125\"><!-- react-text: 126 -->We couldn\u2019t find anything at this address. Please check the URL or go to the <!-- /react-text --><a href=\"/\" data-reactid=\"127\">homepage</a><!-- react-text: 128 -->.<!-- /react-text --></p></div></div></div></div><!-- react-empty: 129 --><div style=\"display:none;\" data-reactid=\"130\"><svg\n  xmlns=\"http://www.w3.org/2000/svg\"\n  xmlns:xlink=\"http://www.w3.org/1999/xlink\"\n  height=\"500\"\n  width=\"500\"\n  viewBox=\"0 0 500 500\"\n  aria-labelledby=\"title desc\"\n>\n  <title id=\"title\">Filters SVG</title>\n  <defs>\n    <filter id=\"bleed\" filterUnits=\"objectBoundingBox\">\n      <feColorMatrix\n        type=\"matrix\"\n        values=\"1 0 0 0 0, 0 0.15 0 0 0, 0 0 .20 0 0, 0 0 0 1 0\"\n      />\n    </filter>\n  </defs>\n</svg></div><div class=\"ThirdPartySlot\" id=\"third-party--viewport-top\" data-reactid=\"131\"></div><div class=\"ThirdPartySlot\" id=\"third-party--viewport-takeover\" data-reactid=\"132\"></div><div class=\"ThirdPartySlot\" id=\"third-party--viewport-bottom\" data-reactid=\"133\"></div></div></div>\n    <script>\n      window.initialStoreTree = {\"bodyClasses\":[],\"categoryPostIDs\":{},\"commentsExpanded\":{},\"contentLanguage\":null,\"dispatcher\":{\"backend\":null,\"node\":null,\"type\":null},\"documentCloud\":{\"document\":{},\"embedUrl\":\"\",\"text\":{}},\"documentIDs\":[],\"error\":{\"message\":\"Page not found\",\"status\":404},\"featureIDs\":[],\"featuresLanguage\":{},\"googleAMPUrl\":\"\",\"hamburgerColor\":null,\"host\":\"theintercept.com\",\"inInitialRender\":false,\"languageLanding\":{},\"liveBlogsUpdatesIDs\":{},\"loading\":false,\"mediaPlayer\":null,\"newsletter\":{\"form\":{\"description\":\"\",\"status\":\"\"}},\"podcastPage\":{},\"podcastsHomepage\":{\"speakingIDs\":[]},\"postLanding\":{\"redirect\":null},\"postsMetaIDs\":{},\"resources\":{\"alerts\":{},\"annotationSets\":{},\"annotations\":{},\"categories\":{},\"comments\":{},\"documents\":{},\"liveBlogs\":{},\"platform\":{\"theintercept\":{\"Article\":{},\"Author\":{},\"Document\":{},\"GeoLocation\":{},\"HttpReturn\":{},\"Podcast\":{},\"PodcastEpisode\":{},\"Section\":{},\"bySpeakingID\":{},\"documentArchives\":{},\"documentReleases\":{},\"nodePromos\":{\"default\":{}}},\"theintercept-brasil\":{\"HttpReturn\":{},\"bySpeakingID\":{},\"latestPromos\":[],\"nodePromos\":{\"default\":{}}}},\"postCommentMeta\":{},\"posts\":{},\"promoBanners\":{},\"series\":{},\"seriesDocuments\":{},\"staff\":{},\"taxonomies\":{},\"timeline\":{}},\"reverseChronIDs\":{},\"route\":{\"names\":[],\"params\":{},\"path\":\"\",\"pathname\":\"\",\"query\":{}},\"routed\":false,\"scrollToPostComments\":null,\"searchResultIDs\":[],\"seriesHomepage\":{\"curatedItems\":{\"postIds\":[],\"seriesSlugs\":[]},\"recentItems\":{\"postIds\":[],\"seriesSlugs\":[]}},\"seriesPostIDs\":{},\"sidToday\":{\"search\":{\"lastCursor\":null,\"loading\":false,\"speakingIDs\":[],\"totalCount\":null}},\"sidTodayFilesUpdateReports\":{\"category\":{}},\"specialSeriesItems\":[],\"squirrelDocumentIDs\":[],\"squirrelIDs\":[],\"staffIDs\":[],\"surveillanceCatalogData\":null,\"surveillanceCatalogVendors\":null,\"tocChapters\":{},\"tracking\":{\"currentUrl\":\"https://theintercept.com/2020/10/27/senator-perdue-ossoff-china/feed/\",\"previousUrl\":null}};\n      window.config = {\"assets\":{\"host\":\"\",\"webpack\":false},\"aws_static\":\"https://static.theintercept.com\",\"coral_talk_api_origin\":\"https://talk.theintercept.com\",\"coral_talk_origin\":\"https://talk.theintercept.com\",\"coral_talk_permalink_cutover_date\":\"2019-06-24T14:00:00.000Z\",\"donation_base_url\":\"https://join.theintercept.com/donate/\",\"donation_base_url_brasil\":\"http://catarse.me/intercept/\",\"env\":{\"NODE_ENV\":\"production\"},\"facebook\":{\"tracking_pixel_id\":\"2151258874911575\"},\"google\":{\"id\":\"UA-79475609-15\"},\"graphql_realm_id\":\"UmVhbG06NTJiOWMwOGEtMjQwYS00NzMxLThlYTAtMjMyY2RiYTYwNzBh\",\"graphql_realm_id__brasil\":\"UmVhbG1Db250ZW50OjE3NTU4MzMyLWUwZTQtNDIyYy1iNDcyLWZkMDQzMmRiOGRhYw==\",\"graphql_url\":\"http://read.usq.flmcloud.local:3002/graphql\",\"hash\":\"42e762a729b53f810f04\",\"host\":\"theintercept.com\",\"imgix\":{\"additional_origins\":[\"https://firstlook.org\"],\"domain\":\"theintercept.imgix.net\"},\"logs\":{\"level\":\"info\"},\"onsite_origins\":[\"https://theintercept.com\"],\"origin\":\"https://theintercept.com\",\"override_private_wp_host\":\"theintercept.com\",\"parsely\":{\"endpoint\":\"https://c.prod.theintercept.com/a\",\"site_id\":\"theintercept.com\"},\"piano\":{\"application_id\":\"hsZyoAWmIE\",\"origin\":\"https://o.prod.theintercept.com\"},\"port\":8080,\"private_wp_origin\":\"https://wp.theintercept.com\",\"public_api_origin\":\"https://theintercept.com\",\"public_wp_origin\":\"https://theintercept.com\",\"request_timeout\":30000,\"set_headers\":{\"Access-Control-Allow-Origin\":\"*\",\"Referrer-Policy\":\"strict-origin-when-cross-origin\",\"Strict-Transport-Security\":\"max-age=63072000; includeSubDomains; preload\",\"Vary\":\"Accept-Encoding\",\"X-Content-Type-Options\":\"nosniff\",\"X-Frame-Options\":\"SAMEORIGIN\",\"X-Xss-Protection\":\"1; mode=block\"},\"site_prefix\":\"\"};\n      window.__COUNTRY_CODE__ = \"US\";\n    </script>\n    \n      \n        <script type=\"application/ld+json\">\n          {\"url\":\"https://theintercept.com/2020/10/27/senator-perdue-ossoff-china/feed/\"}\n        </script>\n      \n      <div id=\"parsely-root\" style=\"display: none\">\n        <div id=\"parsely-cfg\" data-parsely-site=\"theintercept.com\"></div>\n      </div>\n    \n    <script src=\"/assets/app42e762a729b53f810f04.js\"></script>\n  </body>\n</html>\n"

Solution

  • Actually you were loading the file outside the loop which is actually closed! so you just load an empty soup!

    Also since you are dealing with broken HTML, the rel is equal to rel=\"canonical\" so you've to take care about that. or to explicit specify it or to use * within selectors.

    from bs4 import BeautifulSoup
    
    
    with open('a.html') as f:
        soup = BeautifulSoup(f.read(), 'lxml')
    
    for i in soup.select('link[rel*=canonical]'):
        print(i['href'])
    

    Output:

    \"https://theintercept.com/2020/10/27/senator-perdue-ossoff-china/feed/\"/