I'm creating a Python scraper at scraperwiki.com. I need to parse a part of a html page that contains the following code:
<div class="div_class">
<h3>I'm a title. Don't touch me</h3>
<ul>
<li>
I'm a title. Parse me
<ul>
<li>fdfdsfd</li>
<li>fdfdsfd</li>
<li>fdfdsfd</li>
<li>fdfdsfd</li>
</ul>
</li>
<li>
I'm a title. Parse me
<ul>
<li>fdfdsfd</li>
<li>fdfdsfd</li>
<li>fdfdsfd</li>
<li>fdfdsfd</li>
</ul>
</li>
<li>
I'm a title. Parse me
<ul>
<li>fdfdsfd</li>
<li>fdfdsfd</li>
<li>fdfdsfd</li>
<li>fdfdsfd</li>
</ul>
</li>
<li>
I'm a title. Parse me
<ul>
<li>fdfdsfd</li>
<li>fdfdsfd</li>
<li>fdfdsfd</li>
<li>fdfdsfd</li>
</ul>
</li>
</ul>
</div>
I want to parse only "I'm a title. Parse me" titles. Here is how I'm doing it:
import scraperwiki
import lxml.html
import re
import datetime
#.......................
raw_string = lxml.html.fromstring(scraperwiki.scrape(url_to_scrape))
raw_html = raw_string.cssselect("div.div_class ul > li")
for item in ras_html
print(item.text_content())
I does work. But it captures all the data insile ul. I don't want it, I want to find only "I'm a title. Parse me" in each ul and that's it.
How can I do it?
The beauty of the lxml
is that you can use both css selectors and xpath to find any element on the page.
In your case, since you have nested <ul>
lists, it's better to use xpath for navigation:
# find every <li> in the <ul> under div with class div_class
raw_html = raw_string.xpath("//div[@class='div_class']/ul/li")
for item in raw_html:
print(item.text.strip())
prints:
I'm a title. Parse me
I'm a title. Parse me
I'm a title. Parse me
I'm a title. Parse me
Here is the brief explanation of xpath in lxml: http://lxml.de/tutorial.html#using-xpath-to-find-text