Search code examples
pythonweb-scrapingxpathpython-requestslxml

Most efficient way to count nodes using XPath in Python


In Python, how could I count the nodes using XPath? For example, using this webpage and this code:

from lxml import html, etree
import requests
url = "http://intelligencesquaredus.org/debates/past-debates/item/587-islam-is-dominated-by-radicals"
r = requests.get(url)
tree = html.fromstring(r.content)
count = tree.xpath('count(//*[@id="body"])')
print count

It prints 1. But it has 5 div nodes. Please explain this to me, and how can I do this correctly?


Solution

  • It prints 1 (or 1.0) because there is just one such element with id="body" in the HTML file you are fetching.

    I downloaded the file and verified this is the case. E.g.:

    $ curl -O http://intelligencesquaredus.org/debates/past-debates/item/587-islam-is-dominated-by-radicals
    

    Grabs a file 587-islam-is-dominated-by-radicals

    $ grep --count 'id="body"' 587-islam-is-dominated-by-radicals
    

    Answers 1. Just to be extra sure, I hand-searched in the file as well, using vi. Just the one!

    Perhaps you are looking for another div node? One with a different id?

    Update: By the way, XPath and other HTML/XML parsing is pretty challenging to work with. A lot of bad data out there, and a lot of complex markup, times the complexity of the retrieval, parsing, and traversal process. You will probably be running your tests and trials a lot of times. It will be a lot faster if you do not "hit the net" for every one of them. Cache the live results. Raw code looks something like this:

    from lxml import html, etree
    import requests
    
    filepath = "587-islam-is-dominated-by-radicals"
    try:
        contents = open(filepath).read()
        print "(reading cached copy)"
    except IOError:
        url = "http://intelligencesquaredus.org/debates/past-debates/item/587-islam-is-dominated-by-radicals"
        print "(getting file from the net; please stand by)"
        r = requests.get(url)
        contents = r.content
    tree = html.fromstring(contents)
    count = tree.xpath('count(//*[@id="body"])')
    print count
    

    But you can simplify a lot of that by using a generic caching front-end to requests, such as requests-cache. Happy parsing!