Search code examples
pythonbeautifulsouppython-requests-html

urllib.error.HTTPError: HTTP Error 404: Not Found even though I can go to the link?


import requests
from bs4 import BeautifulSoup
import wget   # Downloads files from url

page = requests.get("https://en.wikipedia.org/wiki/Gallery_of_sovereign_state_flags")
soup = BeautifulSoup(page.content, 'html.parser')

for flag in soup.find_all('a', attrs={'class': "image"}):
    src = flag.contents[0]['src']
    src = src.replace("thumb/", "")
    src = "https:" + src
    sep = '.svg'
    fixed_src = src.split(sep, 1)[0] + ".svg"
    print(fixed_src)
    for country in data["Country"]:    # A column containing country names
        if country in fixed_src:
            wget.download(fixed_src, f'flags/{country}.svg')

It works for most of the urls generated, but once it reaches "Australia" it returns the urllib.error.HTTPError: HTTP Error 404: Not Found. but when I press on the link it redirects me to it and it IS found.

I tried placing an if statement to ignore Australia, but few other urls returned the same error.

Any ideas?


Solution

  • I think your problems are mosty likely related to escaped characters in your urls. Browsers know how to resolve them; however it seems that the wget library does not know how to do it, and you have to get rid of the escaped characters yourself.

    Try adding urllib.parse.unquote(fixed_src) to your code before doing the wget. It resolved the problems with 404's at least for me.

    See the difference:

    Before unquoting:

    https://upload.wikimedia.org/wikipedia/commons/7/7a/Flag_of_Afghanistan_%282004%E2%80%932021%29.svg
    

    After unquoting:

    https://upload.wikimedia.org/wikipedia/commons/7/7a/Flag_of_Afghanistan_(2004–2021).svg
    

    Full code below:

    import urllib
    import requests
    from bs4 import BeautifulSoup
    import wget   # Downloads files from url
    
    page = requests.get("https://en.wikipedia.org/wiki/Gallery_of_sovereign_state_flags")
    soup = BeautifulSoup(page.content, 'html.parser')
    
    for flag in soup.find_all('a', attrs={'class': "image"}):
        src = flag.contents[0]['src']
        src = src.replace("thumb/", "")
        src = "https:" + src
        sep = '.svg'
        fixed_src = src.split(sep, 1)[0] + ".svg"
        print(fixed_src)
        url_unquoted = urllib.parse.unquote(fixed_src)
        print(url_unquoted)
        for country in data["Country"]:    # A column containing country names
            if country in url_unquoted:
                wget.download(url_unquoted, f'flags/{country}.svg')
    

    Similar problem, found with "python wget fails for url" from Google

    urllib documentation here