Search code examples
pythonurllib2httplib2

Why does this url raise BadStatusLine with httplib2 and urllib2?


Using httplib2 and urllib2, I'm trying to fetch pages from this url, but all of them didn't work out and ended up with this exception.

content = conn.request(uri="http://www.zdnet.co.kr/news/news_print.asp?artice_id=20110727092902")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/dist-packages/httplib2/__init__.py", line 1129, in request
    (response, content) = self._request(conn, authority, uri, request_uri, method, body, headers, redirections, cachekey)
  File "/usr/lib/python2.7/dist-packages/httplib2/__init__.py", line 901, in _request
    (response, content) = self._conn_request(conn, request_uri, method, body, headers)
  File "/usr/lib/python2.7/dist-packages/httplib2/__init__.py", line 871, in _conn_request
    response = conn.getresponse()
  File "/usr/lib/python2.7/httplib.py", line 1027, in getresponse
    response.begin()
  File "/usr/lib/python2.7/httplib.py", line 407, in begin
    version, status, reason = self._read_status()
  File "/usr/lib/python2.7/httplib.py", line 371, in _read_status
    raise BadStatusLine(line)

HTTP header was like this

http://www.zdnet.co.kr/news/news_print.asp?artice_id=20110727092902

GET /news/news_print.asp?artice_id=20110727092902 HTTP/1.1
Host: www.zdnet.co.kr
User-Agent: Mozilla/5.0 (Windows NT 5.1; rv:10.0.1) Gecko/20100101 Firefox/10.0.1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: ko-kr,ko;q=0.8,en-us;q=0.5,en;q=0.3
Accept-Encoding: gzip, deflate
Connection: keep-alive
Cookie: RMID=7d83495d4f336fe0; __utma=37206251.1552605885.1328771258.1328771258.1329070845.2; __utmz=37206251.1328771258.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); ASPSESSIONIDCSQCQTDD=BCLEHPPDEPHEBJDLCFNDMKDN; __utmc=37206251; ASPSESSIONIDSSQCQQCB=MJPLMOJAFPDFCLONCANBIKHN; _EXEN=2
X-FireLogger: 1.2

HTTP/1.1 200 OK
Date: Mon, 13 Feb 2012 18:02:56 GMT
Content-Length: 19158
Content-Type: text/html;charset=UTF-8; Charset=UTF-8
Set-Cookie: ASPSESSIONIDSQSDQRDB=NGAIFHKAGDIOGEMANAOLLKKF; path=/
Cache-Control: private

Any clue?


Solution

  • This works fine for me:

    import urllib2
    
    opener = urllib2.build_opener()
    
    headers = {
      'User-Agent': 'Mozilla/5.0 (Windows NT 5.1; rv:10.0.1) Gecko/20100101 Firefox/10.0.1',
    }
    
    opener.addheaders = headers.items()
    response = opener.open("http://www.zdnet.co.kr/news/news_print.asp?artice_id=20110727092902")
    
    print response.headers
    print response.read()
    

    The website discards all requests that occur without a User-Agent string.