Search code examples
pythonscreen-scrapinghyperlinklxmlextract

How to extract links from a webpage using lxml, XPath and Python?


I've got this xpath query:

/html/body//tbody/tr[*]/td[*]/a[@title]/@href

It extracts all the links with the title attribute - and gives the href in FireFox's Xpath checker add-on.

However, I cannot seem to use it with lxml.

from lxml import etree
parsedPage = etree.HTML(page) # Create parse tree from valid page.

# Xpath query
hyperlinks = parsedPage.xpath("/html/body//tbody/tr[*]/td[*]/a[@title]/@href") 
for x in hyperlinks:
    print x # Print links in <a> tags, containing the title attribute

This produces no result from lxml (empty list).

How would one grab the href text (link) of a hyperlink containing the attribute title with lxml under Python?


Solution

  • I was able to make it work with the following code:

    from lxml import html, etree
    from StringIO import StringIO
    
    html_string = '''<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
       "http://www.w3.org/TR/html4/loose.dtd">
    
    <html lang="en">
    <head/>
    <body>
        <table border="1">
          <tbody>
            <tr>
              <td><a href="http://stackoverflow.com/foobar" title="Foobar">A link</a></td>
            </tr>
            <tr>
              <td><a href="http://stackoverflow.com/baz" title="Baz">Another link</a></td>
            </tr>
          </tbody>
        </table>
    </body>
    </html>'''
    
    tree = etree.parse(StringIO(html_string))
    print tree.xpath('/html/body//tbody/tr/td/a[@title]/@href')
    
    >>> ['http://stackoverflow.com/foobar', 'http://stackoverflow.com/baz']