I must extract all the text from a <p>
.
This paragraph is full of links, so it is very easy to extract the text, by using this expression:
//div[@class="content clearfix"]/p[2]//a/text()
Problem is sometimes, from the same paragraph, I also need to extract text that is not linked, eg.
<p>
<a href=“url”>text1</a>,
text2,
<a href=“url”>text3</a>,
<a href=“url”>text4</a>,
<a href=“url”>text5</a>,
<a href=“url”>text6</a>,
<a href=“url”>text7</a>,
text8,
text9
</p>
Using the preceding expression I can’t get text2, text8 and text9.
If I extract the text this way:
//div[@class="content clearfix"]/p[2]//text()
I get a mess, because of the unwanted presence of commas, spaces and other characters.
Is there anyway to do what should I do with XPath?
UPDATE: my desidered output is a list like following:
["text1", "text2", "text3", "text4", "text5", "text6", "text7", "text8", "text9"]
Try using normalize-space():
normalize-space(//div[@class="content clearfix"]/p[2])
This will get you close. It would be a string that looks something like this:
text1, text2, text3, text4, text5, text6, text7, text8, text9
Then you could split it up by "," (text
is a variable containing the string above):
split_text = [text_node.strip() for text_node in text.split(",")]
Full example...
from lxml import etree
xml = """
<doc>
<div class="content clearfix">
<p>Added to make p[2] in xpath work.</p>
<p>
<a href="url">text1</a>,
text2,
<a href="url">text3</a>,
<a href="url">text4</a>,
<a href="url">text5</a>,
<a href="url">text6</a>,
<a href="url">text7</a>,
text8,
text9
</p>
</div>
</doc>
"""
parsed_xml = etree.fromstring(xml)
text = parsed_xml.xpath('normalize-space(//div[@class="content clearfix"]/p[2])')
print(text)
split_text = [text_node.strip() for text_node in text.split(",")]
print(split_text)
printed output...
text1, text2, text3, text4, text5, text6, text7, text8, text9
['text1', 'text2', 'text3', 'text4', 'text5', 'text6', 'text7', 'text8', 'text9']