I'm trying out web scraping for the first time using lxml.etree
. The website I want to scrape has an XML feed, which I can read fine, except for a part of its XML which is embedded within a CDATA section:
from lxml import etree
parser = etree.XMLParser(recover=True)
data=b'''<?xml version="1.0" encoding="UTF-8"?>
<feed>
<entry>
<summary type="xhtml"><![CDATA[<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<REMITUrgentMarketMessages>
<UMM>
<messageId>2023-86___________________001</messageId>
<event>
<eventStatus>Active</eventStatus>
<eventType>Other unavailability</eventType>
<eventStart>2023-09-07T06:00:00.000+02:00</eventStart>
<eventStop>2023-09-10T06:00:00.000+02:00</eventStop>
</event>
<unavailabilityType>Planned</unavailabilityType>
<publicationDateTime>2022-10-06T13:42:00.000+02:00</publicationDateTime>
<capacity>
<unitMeasure>mcm/d</unitMeasure>
<unavailableCapacity>9.0</unavailableCapacity>
<availableCapacity>0.0</availableCapacity>
<technicalCapacity>9.0</technicalCapacity>
</capacity>
<unavailabilityReason>Yearly maintenance</unavailabilityReason>
<remarks>Uncertain duration</remarks>
<balancingZone>21Y000000000024I</balancingZone>
<balancingZone>21Y0000000001278</balancingZone>
<balancingZone>21YGB-UKGASGRIDW</balancingZone>
<balancingZone>21YNL----TTF---1</balancingZone>
<balancingZone>37Y701125MH0000I</balancingZone>
<balancingZone>37Y701133MH0000P</balancingZone>
<affectedAsset>
<ns2:name>Dvalin</ns2:name>
</affectedAsset>
<marketParticipant>
<ns2:name>Gassco AS</ns2:name>
<ns2:eic>21X-NO-A-A0A0A-2</ns2:eic>
</marketParticipant>
</UMM>
</REMITUrgentMarketMessages>]]></summary>
</entry>
</feed>
'''
tree = etree.fromstring(data)
block = tree.xpath("/feed/entry/summary")[0]
block_str = "b'''"+block.text+"'''"
tree_in_tree = etree.fromstring(block_str)
The problem the XML code in the CDATA section is weirdly indented, meaning that if I just pass the CDATA content into a string and then read it with etree (like I do below), I get a message error because of indentation.
This is the message:
XMLSyntaxError: Start tag expected, '<' not found, line 1, column 1
Basically I understand that the indentation between the first line of CDATA and REMITUrgentMarketMessages is badly indented.
Does anyone know how to fix this? :)
Thanks for the help!
The b
prefix is used for bytes literals, but block.text
is not a literal. Instead, create the bytes object (representing the embedded XML document) using bytes()
:
block_str = bytes(block.text, "UTF-8")
Now when the program is run, you will get the following error:
lxml.etree.XMLSyntaxError: Namespace prefix ns2 on name is not defined
That is a serious error, but it can be bypassed by using the parser
configured with recover=True
:
tree_in_tree = etree.fromstring(block_str, parser)