Search code examples
pythonurllib2lxmlpython-requests

Python - Replacing urlib2 with Requests using lxml


I'm attempting to replace urllib2 with requests in this code that I have for simply pulling some information from a page. I'm not 100% sure how I should be going about moving over libraries. This is what I have so far, with the error, what am I doing wrong?

CODE:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import requests, sys
from lxml import etree
# import urllib2

# UTF8
reload(sys)
sys.setdefaultencoding("utf-8")

# url = 'http://countrycode.org/Germany'
# opener = urllib2.build_opener()
# opener.addheaders = [('User-agent', 'USERAGENT')]
r = requests.get('http://countrycode.org/Germany')
response = r.text
htmlparser = etree.HTMLParser()
tree = etree.parse(response, htmlparser)

countryCodeXpath = '//*[@id="main_table_blue_2"]/tr[3]/td[2]'
countryCode = tree.xpath(countryCodeXpath)
destCountryCode = countryCode[0].text

print destCountryCode

ERROR:

Traceback (most recent call last):
  File "/home/ubuntu/test.py", line 16, in <module>
    tree = etree.parse(response, htmlparser)
  File "lxml.etree.pyx", line 3196, in lxml.etree.parse (src/lxml/lxml.etree.c:64039)
  File "parser.pxi", line 1549, in lxml.etree._parseDocument (src/lxml/lxml.etree.c:91262)
  File "parser.pxi", line 1578, in lxml.etree._parseDocumentFromURL (src/lxml/lxml.etree.c:91546)
  File "parser.pxi", line 1478, in lxml.etree._parseDocFromFile (src/lxml/lxml.etree.c:90613)
  File "parser.pxi", line 1025, in lxml.etree._BaseParser._parseDocFromFile (src/lxml/lxml.etree.c:87527)
  File "parser.pxi", line 565, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:83101)
  File "parser.pxi", line 656, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:84083)
  File "parser.pxi", line 594, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:83379)
IOError: Error reading file '<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<SNIP>

Solution

  • The problem is that you're calling parse with a string.

    In the ElementTree API (whether the stdlib versions, the separate module from PyPI, or the lxml implementation), the parse function takes a filename or a file:

    The source can be any of the following:

    • a file name/path
    • a file object
    • a file-like object
    • a URL using the HTTP or FTP protocol

    So, it's trying to open a file named <!DOCTYPE HTML PU…, which of course doesn't exist.

    As the docs says:

    To parse from a string, use the fromstring() function instead.


    There are a few alternatives.

    First, as quoted above, lxml.etree can retrieve a URL for you. Unless you actually need any extra features of requests here, this will be a lot simpler. And it will be faster, and it won't require reading the entire file into memory, and it even allows you to automatically look up DTDs and other external references. As the docs say:

    Note that it is generally faster to parse from a file path or URL than from an open file object or file-like object. Transparent decompression from gzip compressed sources is supported (unless explicitly disabled in libxml2).

    Or you could use the requests streaming protocol to get a file-like object instead of the contents, as sebastian's answer explains. This will be more complicated rather than less, and intermediate in speed between the other two options… but if you need additional features from requests, and you can't afford to hold the entire page in memory, it's the best option.

    However, for a file as small as this one (46K), there's really no reason to avoid loading it all at once.