Search code examples
pythonxpathlxmlhref

Extract text not surrounded by hrefs


I must extract all the text from a <p>.

This paragraph is full of links, so it is very easy to extract the text, by using this expression:

//div[@class="content clearfix"]/p[2]//a/text()

Problem is sometimes, from the same paragraph, I also need to extract text that is not linked, eg.

<p>
<a href=“url”>text1</a>,
text2,
<a href=“url”>text3</a>,
<a href=“url”>text4</a>,
<a href=“url”>text5</a>,
<a href=“url”>text6</a>,
<a href=“url”>text7</a>,
text8,
text9
</p>

Using the preceding expression I can’t get text2, text8 and text9.

If I extract the text this way:

//div[@class="content clearfix"]/p[2]//text()

I get a mess, because of the unwanted presence of commas, spaces and other characters.

Is there anyway to do what should I do with XPath?

UPDATE: my desidered output is a list like following:

["text1", "text2", "text3", "text4", "text5", "text6", "text7", "text8", "text9"]

Solution

  • Try using normalize-space():

    normalize-space(//div[@class="content clearfix"]/p[2])
    

    This will get you close. It would be a string that looks something like this:

    text1, text2, text3, text4, text5, text6, text7, text8, text9
    

    Then you could split it up by "," (text is a variable containing the string above):

    split_text = [text_node.strip() for text_node in text.split(",")]
    

    Full example...

    from lxml import etree
    
    xml = """
    <doc>
    <div class="content clearfix">
    <p>Added to make p[2] in xpath work.</p>
    <p>
    <a href="url">text1</a>,
    text2,
    <a href="url">text3</a>,
    <a href="url">text4</a>,
    <a href="url">text5</a>,
    <a href="url">text6</a>,
    <a href="url">text7</a>,
    text8,
    text9
    </p>
    </div>
    </doc>
    """
    
    parsed_xml = etree.fromstring(xml)
    
    text = parsed_xml.xpath('normalize-space(//div[@class="content clearfix"]/p[2])')
    
    print(text)
    
    split_text = [text_node.strip() for text_node in text.split(",")]
    
    print(split_text)
    

    printed output...

    text1, text2, text3, text4, text5, text6, text7, text8, text9
    ['text1', 'text2', 'text3', 'text4', 'text5', 'text6', 'text7', 'text8', 'text9']