Search code examples
python-2.7xpathlxml.html

xpath matching wrong node


The xpath

//*[h1]

shows different results when tried on python and Firebug. My code:

import requests
from lxml import html

url = "http://machinelearningmastery.com/naive-bayes-classifier-scratch-python/"
resp = requests.get(url)
page = html.fromstring(resp.content)

node = page.xpath("//*[h1]")
print node
#[<Element center at 0x7fb42143c7e0>]

But Firebug matches to a <header> tag which is what I desire.

Why is this so? How do i make my python code match <header> too?


Solution

  • You are missing the User-Agent header and hence the response content returned 403 Forbidden, add it to request and it works as expected:

    In [9]: resp = requests.get(url, headers={"User-Agent": "Test Agent"})
    
    In [10]: page = html.fromstring(resp.content)
    
    In [11]: node = page.xpath("//*[h1]")
    
    In [12]: print node
    [<Element header at 0x104ff15d0>]