Extract text not surrounded by hrefs

I must extract all the text from a <p>.

This paragraph is full of links, so it is very easy to extract the text, by using this expression:

//div[@class="content clearfix"]/p[2]//a/text()

Problem is sometimes, from the same paragraph, I also need to extract text that is not linked, eg.

<p>
<a href=“url”>text1</a>,
text2,
<a href=“url”>text3</a>,
<a href=“url”>text4</a>,
<a href=“url”>text5</a>,
<a href=“url”>text6</a>,
<a href=“url”>text7</a>,
text8,
text9
</p>

Using the preceding expression I can’t get text2, text8 and text9.

If I extract the text this way:

//div[@class="content clearfix"]/p[2]//text()

I get a mess, because of the unwanted presence of commas, spaces and other characters.

Is there anyway to do what should I do with XPath?

UPDATE: my desidered output is a list like following:

["text1", "text2", "text3", "text4", "text5", "text6", "text7", "text8", "text9"]

Solution

Try using normalize-space():

normalize-space(//div[@class="content clearfix"]/p[2])

This will get you close. It would be a string that looks something like this:

text1, text2, text3, text4, text5, text6, text7, text8, text9

Then you could split it up by "," (text is a variable containing the string above):

split_text = [text_node.strip() for text_node in text.split(",")]

Full example...

from lxml import etree

xml = """
<doc>
<div class="content clearfix">
<p>Added to make p[2] in xpath work.</p>
<p>
<a href="url">text1</a>,
text2,
<a href="url">text3</a>,
<a href="url">text4</a>,
<a href="url">text5</a>,
<a href="url">text6</a>,
<a href="url">text7</a>,
text8,
text9
</p>
</div>
</doc>
"""

parsed_xml = etree.fromstring(xml)

text = parsed_xml.xpath('normalize-space(//div[@class="content clearfix"]/p[2])')

print(text)

split_text = [text_node.strip() for text_node in text.split(",")]

print(split_text)

printed output...

text1, text2, text3, text4, text5, text6, text7, text8, text9
['text1', 'text2', 'text3', 'text4', 'text5', 'text6', 'text7', 'text8', 'text9']