Search code examples
pythonpython-requestspycurlhttp-refererrequest-headers

python requests: adding "referer" header to redirected requests


I am wondering if python requests support "autoreferer" functionality in curl. Basically, for allow_redirects=True, the requests should set the "Referer" header for subsequent redirected requests automatically.

Here is how request headers look like (without "Referer" header) using requests:

>>> import requests
>>> import logging
>>> import http.client
>>> http.client.HTTPConnection.debuglevel = 1
>>> logging.basicConfig()
>>> logging.getLogger().setLevel(logging.DEBUG)
>>> requests_log = logging.getLogger("requests.packages.urllib3")
>>> requests_log.setLevel(logging.DEBUG)
>>> requests_log.propagate = True
>>> r = requests.post('http://www.somewebsite.com', allow_redirects=True)
DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): www.somewebsite.com:80
send: b'POST / HTTP/1.1\r\nHost: www.somewebsite.com\r\nAccept: */*\r\nUser-Agent: python-requests/2.21.0\r\nConnection: keep-alive\r\nAccept-Encoding: gzip, deflate\r\nContent-Length: 0\r\n\r\n'
reply: 'HTTP/1.1 307 Temporary Redirect\r\n'
DEBUG:urllib3.connectionpool:http://www.somewebsite.com:80 "POST / HTTP/1.1" 307 185
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): www.somewebsite.com:443
header: Server header: Date header: Content-Type header: Content-Length header: Connection header: Location header: X-Cache header: Via header: X-Amz-Cf-Pop header: X-Amz-Cf-Id
send: b'POST / HTTP/1.1\r\nHost: www.somewebsite.com\r\nAccept: */*\r\nUser-Agent: python-requests/2.21.0\r\nConnection: keep-alive\r\nAccept-Encoding: gzip, deflate\r\nContent-Length: 0\r\n\r\n'
reply: 'HTTP/1.1 302 Moved Temporarily\r\n'
DEBUG:urllib3.connectionpool:https://www.somewebsite.com:443 "POST / HTTP/1.1" 302 13
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): somewebsite.com:443
header: Content-Type header: Content-Length header: Connection header: Date header: Location header: Access-Control-Allow-Origin header: X-Cache header: Via header: X-Amz-Cf-Pop header: X-Amz-Cf-Id
send: b'GET / HTTP/1.1\r\nHost: somewebsite.com\r\nAccept: */*\r\nUser-Agent: python-requests/2.21.0\r\nConnection: keep-alive\r\nAccept-Encoding: gzip, deflate\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
DEBUG:urllib3.connectionpool:https://somewebsite.com:443 "GET / HTTP/1.1" 200 149681
header: Content-Type header: Content-Length header: Connection header: Date header: Server header: Expires header: Last-Modified header: Content-Encoding header: Via header: Vary header: Accept-Ranges header: Cache-Control header: Set-Cookie header: X-Cache header: X-Amz-Cf-Pop header: X-Amz-Cf-Id >>> 
>>> 

And here is how request headers look like (with "Referer" header) using pycurl:

>>> import pycurl
>>> from io import BytesIO
>>> buffer = BytesIO()
>>> c = pycurl.Curl()
>>> c.setopt(c.URL, 'http://www.somewebsite.com/')
>>> c.setopt(c.WRITEDATA, buffer)
>>> c.setopt(pycurl.VERBOSE, 1)
>>> c.setopt(pycurl.AUTOREFERER, 1)
>>> c.setopt(pycurl.FOLLOWLOCATION, 1)
>>> c.perform()
>>> c.close()
*   Trying 99.84.194.56...
* Connected to www.somewebsite.com (99.84.194.56) port 80 (#0)
> GET / HTTP/1.1
Host: www.somewebsite.com
User-Agent: PycURL/7.43.0.2 libcurl/7.47.0 OpenSSL/1.0.2g zlib/1.2.8 libidn/1.32 librtmp/2.3
Accept: */*

< HTTP/1.1 301 Moved Permanently
< Server: CloudFront
< Date: Wed, 26 Feb 2020 21:46:55 GMT
< Content-Type: text/html
< Content-Length: 183
< Connection: keep-alive
< Location: https://www.somewebsite.com/
< X-Cache: Redirect from cloudfront
< Via: 1.1 40ddfb9607f5d49c286c41e9afdce772.cloudfront.net (CloudFront)
< X-Amz-Cf-Pop: LAX3-C3
< X-Amz-Cf-Id: Uij3cpBtl0ZJ_OwFFDSint5ab3Ayvn0okmhJekgtxI-etIN5l07sjg==
< 
* Ignoring the response-body
* Connection #0 to host www.somewebsite.com left intact
* Issue another request to this URL: 'https://www.somewebsite.com/'
* Found bundle for host www.somewebsite.com: 0x2ab53b0 [can pipeline]
*   Trying 99.84.194.113...
* Connected to www.somewebsite.com (99.84.194.113) port 443 (#1)
* ALPN, offering http/1.1
* Cipher selection: ALL:!EXPORT:!EXPORT40:!EXPORT56:!aNULL:!LOW:!RC4:@STRENGTH
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server accepted to use http/1.1
* Server certificate:
*    subject: CN=watchdisneyfe.com
*    start date: Dec 16 00:00:00 2019 GMT
*    expire date: Jan 16 12:00:00 2021 GMT
*    subjectAltName: www.somewebsite.com matched
*    issuer: C=US; O=Amazon; OU=Server CA 1B; CN=Amazon
*    SSL certificate verify ok.
> GET / HTTP/1.1
Host: www.somewebsite.com
User-Agent: PycURL/7.43.0.2 libcurl/7.47.0 OpenSSL/1.0.2g zlib/1.2.8 libidn/1.32 librtmp/2.3
Accept: */*
Referer: http://www.somewebsite.com/

< HTTP/1.1 302 Moved Temporarily
< Content-Type: text/plain
< Content-Length: 13
< Connection: keep-alive
< Date: Wed, 26 Feb 2020 21:46:55 GMT
< Location: https://somewebsite.com/
< Access-Control-Allow-Origin: *
< X-Cache: Miss from cloudfront
< Via: 1.1 74d35431a23bfc97a6055173d9be2dc4.cloudfront.net (CloudFront)
< X-Amz-Cf-Pop: LAX3-C3
< X-Amz-Cf-Id: Bxg1W9zPN7U4i8GqysA11vj6h2dyDZdClyMUfUMfVUqd-v_mrQXGhQ==
< 
* Ignoring the response-body
* Connection #1 to host www.somewebsite.com left intact
* Issue another request to this URL: 'https://somewebsite.com/'
*   Trying 13.225.146.93...
* Connected to somewebsite.com (13.225.146.93) port 443 (#2)
* ALPN, offering http/1.1
* Cipher selection: ALL:!EXPORT:!EXPORT40:!EXPORT56:!aNULL:!LOW:!RC4:@STRENGTH
* successfully set certificate verify locations:
*   CAfile: /etc/ssl/certs/ca-certificates.crt
  CApath: /etc/ssl/certs
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN, server accepted to use http/1.1
* Server certificate:
*    subject: CN=watchdisneyfe.com
*    start date: Dec 16 00:00:00 2019 GMT
*    expire date: Jan 16 12:00:00 2021 GMT
*    subjectAltName: somewebsite.com matched
*    issuer: C=US; O=Amazon; OU=Server CA 1B; CN=Amazon
*    SSL certificate verify ok.
> GET / HTTP/1.1
Host: somewebsite.com
User-Agent: PycURL/7.43.0.2 libcurl/7.47.0 OpenSSL/1.0.2g zlib/1.2.8 libidn/1.32 librtmp/2.3
Accept: */*
Referer: https://www.somewebsite.com/

< HTTP/1.1 200 OK
< Content-Type: text/html; charset=utf-8
< Content-Length: 1218349
< Connection: keep-alive
< Vary: Accept-Encoding
< Date: Wed, 26 Feb 2020 21:46:55 GMT
< Server: nginx/1.16.1
< Expires: Wed, 26 Feb 2020 21:56:48 GMT
< Last-Modified: Wed, 26 Feb 2020 21:56:48 GMT
< Via: 1.1 varnish-v4, 1.1 a52dcb1fed052adbd58b868375961d24.cloudfront.net (CloudFront)
< Vary: Accept-Encoding
< Accept-Ranges: bytes
< Cache-Control: max-age=0, must-revalidate
< Set-Cookie: SWID=72B09DFD-D038-485C-C836-7229EB59F0B1; path=/; Expires=Sun, 26 Feb 2040 21:46:55 GMT; domain=somewebsite.com;
< X-Cache: Miss from cloudfront
< X-Amz-Cf-Pop: LAX3-C4
< X-Amz-Cf-Id: JGF1k-OnDIZT_1DP5psnrlb9jmmp7rq69QbGNZL1CVGbjJWjORwpGQ==
< 
* Connection #2 to host somewebsite.com left intact

Is there anyway to add the "Referer" header automatically as curl does?

Note: if you want to try it out, replace "somewebsite" to "abc", for instance.


Solution

  • requests doesn't have any official hooks for this task. But you could subclass requests.Session to wrap a method that's called for each redirect: Session.rebuild_auth():

    When being redirected we may want to strip authentication from the request to avoid leaking credentials. This method intelligently removes and reapplies authentication where possible to avoid credential loss.

    Because it is called with the next (prepared) request as well as the previous response that triggered the redirect, it is ideally situated to add the Referer header:

    import requests
    
    class RefererSession(requests.Session):
        def rebuild_auth(self, prepared_request, response):
            super().rebuild_auth(prepared_request, response)
            prepared_request.headers["Referer"] = response.url
    

    then use this subclass for all requests:

    with RefererSession() as session:
        r = session.post('http://www.somewebsite.com', allow_redirects=True)
    

    Demo using https://httpbin.org:

    >>> import requests
    >>> import http.client
    >>> http.client.HTTPConnection.debuglevel = 1
    >>> def echo_request_lines(msg, *rest):
    ...     """HTTPConnection debug print handler, writes out request lines"""
    ...     if msg != 'send:': return
    ...     request_lines = literal_eval(rest[0]).replace(b'\r', b'')
    ...     print(request_lines.rstrip().decode('latin1'))
    ...     print()
    ...
    >>> http.client.HTTPConnection.debuglevel = 1
    >>> http.client.print = echo_request_lines
    >>> class RefererSession(requests.Session):
    ...     def rebuild_auth(self, prepared_request, response):
    ...         super().rebuild_auth(prepared_request, response)
    ...         prepared_request.headers["Referer"] = response.url
    ...
    >>> with RefererSession() as session:
    ...     r = session.get('https://httpbin.org/redirect/2')
    ...
    GET /redirect/2 HTTP/1.1
    Host: httpbin.org
    User-Agent: python-requests/2.22.0
    Accept-Encoding: gzip, deflate
    Accept: */*
    Connection: keep-alive
    
    GET /relative-redirect/1 HTTP/1.1
    Host: httpbin.org
    User-Agent: python-requests/2.22.0
    Accept-Encoding: gzip, deflate
    Accept: */*
    Connection: keep-alive
    Referer: https://httpbin.org/redirect/2
    
    GET /get HTTP/1.1
    Host: httpbin.org
    User-Agent: python-requests/2.22.0
    Accept-Encoding: gzip, deflate
    Accept: */*
    Connection: keep-alive
    Referer: https://httpbin.org/relative-redirect/1
    
    >>> from pprint import pprint
    >>> pprint(dict(r.history[1].request.headers))
    {'Accept': '*/*',
     'Accept-Encoding': 'gzip, deflate',
     'Connection': 'keep-alive',
     'Referer': 'https://httpbin.org/redirect/2',
     'User-Agent': 'python-requests/2.22.0'}
    >>> pprint(dict(r.request.headers))
    {'Accept': '*/*',
     'Accept-Encoding': 'gzip, deflate',
     'Connection': 'keep-alive',
     'Referer': 'https://httpbin.org/relative-redirect/1',
     'User-Agent': 'python-requests/2.22.0'}