Search code examples

Parsing XML document that includes another XML document embedded in a CDATA section

I'm trying out web scraping for the first time using lxml.etree. The website I want to scrape has an XML feed, which I can read fine, except for a part of its XML which is embedded within a CDATA section:

from lxml import etree

parser = etree.XMLParser(recover=True)

data=b'''<?xml version="1.0" encoding="UTF-8"?>
    <summary type="xhtml"><![CDATA[<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
                    <eventType>Other unavailability</eventType>
                <unavailabilityReason>Yearly maintenance</unavailabilityReason>
                <remarks>Uncertain duration</remarks>
                    <ns2:name>Gassco AS</ns2:name>

tree = etree.fromstring(data)
block = tree.xpath("/feed/entry/summary")[0]

block_str = "b'''"+block.text+"'''"

tree_in_tree = etree.fromstring(block_str)

The problem the XML code in the CDATA section is weirdly indented, meaning that if I just pass the CDATA content into a string and then read it with etree (like I do below), I get a message error because of indentation.

This is the message:

XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1

Basically I understand that the indentation between the first line of CDATA and REMITUrgentMarketMessages is badly indented.

Does anyone know how to fix this? :)

Thanks for the help!


  • The b prefix is used for bytes literals, but block.text is not a literal. Instead, create the bytes object (representing the embedded XML document) using bytes():

    block_str = bytes(block.text, "UTF-8")

    Now when the program is run, you will get the following error:

    lxml.etree.XMLSyntaxError: Namespace prefix ns2 on name is not defined

    That is a serious error, but it can be bypassed by using the parser configured with recover=True:

    tree_in_tree = etree.fromstring(block_str, parser)