I'm writing a tool that needs a collect all urls within a div on a web page but no urls outside that div. Simplified the page it looks something like this:
<div id="bar">
<a link I dont want>
<div id="foo">
<lots of html>
<h1 class="baz">
<a href=”link I want”>
</h1>
<h1 class="caz">
<a href=“link I want”>
</h1>
</div>
</div>
When selecting the div with Firebug and selecting XPath i get: //*[@id="foo"]. So far so good. However I'm stuck at trying to find all urls inside the div foo. Please help me find a way to extract the url defined by the href in a elements.
Example code similar to what I'm working on using w3schools:
import mechanize
import lxml.html
import cookielib
br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.set_handle_equiv(True)
br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
br.addheaders = [('User-agent', 'WatcherBot')]
r = br.open('http://w3schools.com/')
html = br.response().read()
root = lxml.html.fromstring(html)
hrefs = root.xpath('//*[@id="leftcolumn"]')
# Found no solution yet. Stuck
Thank you for your time!
You probably want this:
hrefs = root.xpath('//div[@id="foo"]//a/@href')
This will give you a list of all href
values from a
tags inside <div id="foo">
at any level