Downloading file with urllib2 vs requests: Why are these outputs different?

This is a follow-up from a question I saw earlier today. In this question, a user asks about a problem downloading a pdf from this url:

http://journals.sagepub.com/doi/pdf/10.1177/0956797614553009

I would think that the two download functions below would give the same result, but the urllib2 version downloads some html with a script tag referencing a pdf loader, while the requests version downloads the real pdf. Can someone explain the difference in behavior?

import urllib2
import requests

def get_pdf_urllib2(url, outfile='ex.pdf'):
    resp = urllib2.urlopen(url)
    with open(outfile, 'wb') as f:
        f.write(resp.read())

def get_pdf_requests(url, outfile='ex.pdf'):
    resp = requests.get(url)
    with open(outfile, 'wb') as f:
        f.write(resp.content)

Is requests smart enough to wait for dynamic websites to render before downloading?

Edit Following up on @cwallenpoole's idea, I compared the headers and tried swapping headers from the requests request into the urllib2 request. The magic header was Cookie; the below functions write the same file for the example URL.

def get_pdf_urllib2(url, outfile='ex.pdf'):
    req = urllib2.request(url, headers={'Cookie':'I2KBRCK=1'})
    resp = urllib2.urlopen(req)
    with open(outfile, 'wb') as f:
        f.write(resp.read())

def get_pdf_requests(url, outfile='ex.pdf'):
    resp = requests.get(url)
    with open(outfile, 'wb') as f:
        f.write(resp.content)

Next question: where did requests get that cookie? Is requests making multiple trips to the server?

Edit 2 Cookie came from a redirect header:

>>> handler=urllib2.HTTPHandler(debuglevel=1)
>>> opener=urllib2.build_opener(handler)
>>> urllib2.install_opener(opener)
>>> respurl=urllib2.urlopen(req1)
send: 'GET /doi/pdf/10.1177/0956797614553009 HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: journals.sagepub.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n'
reply: 'HTTP/1.1 302 Found\r\n'
header: Server: AtyponWS/7.1
header: P3P: CP="NOI DSP ADM OUR IND OTC"
header: Location: http://journals.sagepub.com/doi/pdf/10.1177/0956797614553009?cookieSet=1
header: Set-Cookie: I2KBRCK=1; path=/; expires=Thu, 14-Dec-2017 17:28:28 GMT
header: Content-Type: text/html; charset=utf-8
header: Content-Length: 110
header: Connection: close
header: Date: Wed, 14 Dec 2016 17:28:28 GMT
send: 'GET /doi/pdf/10.1177/0956797614553009?cookieSet=1 HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: journals.sagepub.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n'
reply: 'HTTP/1.1 302 Found\r\n'
header: Server: AtyponWS/7.1
header: Location: http://journals.sagepub.com/action/cookieAbsent
header: Content-Type: text/html; charset=utf-8
header: Content-Length: 85
header: Connection: close
header: Date: Wed, 14 Dec 2016 17:28:28 GMT
send: 'GET /action/cookieAbsent HTTP/1.1\r\nAccept-Encoding: identity\r\nHost: journals.sagepub.com\r\nConnection: close\r\nUser-Agent: Python-urllib/2.7\r\n\r\n'
reply: 'HTTP/1.1 200 OK\r\n'
header: Server: AtyponWS/7.1
header: Cache-Control: no-cache
header: Pragma: no-cache
header: X-Webstats-RespID: 8344872279f77f45555d5f9aeb97985b
header: Set-Cookie: JSESSIONID=aaavQMGH8mvlh_-5Ct7Jv; path=/
header: Content-Type: text/html; charset=UTF-8
header: Connection: close
header: Transfer-Encoding: chunked
header: Date: Wed, 14 Dec 2016 17:28:28 GMT
header: Vary: Accept-Encoding

Solution

I'll bet that it's an issue with the User Agent header (I just used curl http://journals.sagepub.com/doi/pdf/10.1177/0956797614553009 and got the same as you report with urllib2). This is part of the request header that lets a website know what type of program/user/whatever is accessing the site (not the library, the HTTP request).

By default, it looks like urllib2 uses: Python-urllib/2.1
And requests uses: python-requests/{package version} {runtime}/{runtime version} {uname}/{uname -r}

If you're working on a Mac, I'll bet that the site is reading Darwin/13.1.0 or similar and then serving you the macos appropriate content. Otherwise, it's probably trying to direct you to some default alternate content (or prevent you from scraping that URL).