Search code examples
pythonhttphttprequesturllibhttp-status-code-410

Strange 410 http gone using python urllib not reproductible in wget


I am using urllib in python3 to fetch some images from my server:

import urllib.request
import urllib.error
        try:
            resp = urllib.request.urlopen(url)
        except urllib.error.HTTPError as err:
            print("code "  + str(err.status) + " reason " + err.reason)

Running the file outputs a 410 HTTP Gone error,

 $ python3.6 file.py 

download: http://some_url.com/image.jpg
code 410 reason Gone

Traceback (most recent call last):
  File "file.py", line 32, in <module>
    image = image_from_url(url)

But I know for sure the image is there, since wget returns the image fine:

$ wget http://some_url.com/image.jpg

--2019-10-11 16:24:05--  http://some_url.com/image.jpg
Resolving some_url.com...
Connecting to some_url.com|...|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 127891 (125K) [image/jpeg]
Saving to: 'image.jpg'

Any ideas on what is causing this? Something on the server side? Is there some specific header that should go in the urllib request?

Thank you


Solution

  • urllib request:

    GET /wikipedia/commons/c/c9/Moon.jpg HTTP/1.1
    Accept-Encoding: identity
    Host: upload.wikimedia.org
    User-Agent: Python-urllib/3.6
    Connection: close
    

    wget request:

    GET /wikipedia/commons/c/c9/Moon.jpg HTTP/1.1
    User-Agent: Wget/1.19.4 (linux-gnu)
    Accept: */*
    Accept-Encoding: identity
    Host: upload.wikimedia.org
    Connection: Keep-Alive
    

    Try adding the Accept: */* header? A bit of research indicates it's somewhat common practice to filter out requests that are missing this header, because they are usually bots.

    req = urllib.request.Request('some_url', headers = {'Accept': '*/*'})
    resp = urllib.request.urlopen(req)