Search code examples
pythonurllib3

urllib3 download a file using specified user agent


What is the correct way to update the user agent information in urllib3?

How can I check that the user agent information was indeed changed and is being used?

For example:

user_agent = {'user-agent': 'Mozilla/5.0 (Windows NT 6.3; rv:36.0) Gecko/20100101 Firefox/36.0'}
http = urllib3.PoolManager(10, headers=user_agent)

r1 = http.request('GET', 'http://example.com/')
if r1.status is 200:
    with open('somefile','w+') as f:
        f.write(r1.data)

When I create a PoolManager at http I looked at it by dir(http) and saw that http.headers was empty by default and updated to the user agent info specified, but is it being used? Is there anyway to check without having to look at apache logs?

And actually checking /var/log/apache2/access.log after trying to update the user agent:

>>> import urllib3
>>> user_agent = {'user-agent': 'Mozilla/5.0 (Windows NT 6.3; rv:36.0) Gecko/20100101 Firefox/36.0'}
>>> http = urllib3.PoolManager(2, headers=user_agent)
>>> r = http.request('GET','localhost')
>>> with open('/var/log/apache2/access.log','r') as f:
...     last_line = f.readlines()[-1]
... 
>>> last_line
'127.0.0.1 - - [08/Dec/2014:20:42:04 -0500] "GET / HTTP/1.1" 200 461 "-" "-"\n'

Solution

  • header argument should be headers:

    http = urllib3.PoolManager(10, header=user_agent)
    

    You can confirm that headers were set correctly using sites like httpbin.org:

    >>> import urllib3
    >>> user_agent = {'user-agent': 'Mozilla/5.0 (Windows NT 6.3; rv:36.0) ..'}
    >>> http = urllib3.PoolManager(10, headers=user_agent)
    >>> r1 = http.urlopen('GET', 'http://httpbin.org/headers')
    >>> print(r1.data)
    {
      "headers": {
        "Accept-Encoding": "identity",
        "Connect-Time": "1",
        "Connection": "close",
        "Host": "httpbin.org",
        "Total-Route-Time": "0",
        "User-Agent": "Mozilla/5.0 (Windows NT 6.3; rv:36.0) Gecko/20100101 Firefox/36.0",
        "Via": "1.1 vegur",
        "X-Request-Id": "5ef53f21-6caf-4e45-8123-98e417cd05ba"
      }
    }
    

    or you can use a packet analyzer (eg. Wireshark).