Search code examples
pythonxmlparsinglxmlcdata

Find and Replace CDATA Attribute Values in XML - Python


I am attempting to demonstrate functionality for finding/replacing XML attributes, similar to that posed in a related question (Find and Replace XML Attributes by Indexing - Python), but for content contained within a CDATA string. Specifically, I would like to know if it is possible to find and replace CDATA attribute values with new values via indexing. I am attempting to replace the first and second attribute values within the first set of 'td' subelements, and also the second and third attribute values for the second set of 'td' subelements. Below is the XML, along with the script I am using and the new values to be added to the desired output XML:

The XML ("foo_bar_CDATA.xml"):

<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2" xmlns:gx="http://www.google.com/kml/ext/2.2">
<Overlay>
    <description>
    <![CDATA[
    <html>
    <head>
        <body>
            <div id="view">
                <div class="item">
                    <tr id="source">
                        <td class="raster">Source</td>
                        <td class="number">1800</td>
                        <td class="number">2100</td>
                    </tr>
                    <tr id="preview">
                        <td class="raster">Preview</td>
                        <td class="number">1100</td>
                        <td class="number">1500</td>
                    </tr>
                </div>
            </div>
        </body>
    </head>
    </html>
    ]]>
    </description>   
</Overlay></kml>

The script:

import lxml.etree as ET
xml = ET.parse("C:\\Users\\mdl518\\Desktop\\bar_foo_CDATA.xml")
tree=xml.getroot().getchildren()[0][1]

val_1 = 1900
val_2 = 2000
val_3 = 3000
val_4 = 4000

# Find and replace the "td" subelement attribute values with the new values (val_"x") 
for elem in tree.getiterator():
    if elem.text:
        elem.text=elem.text.replace('Source',val_1)
    if elem.text:
        elem.text=elem.text.replace('1800',val_2)
    if elem.text:
        elem.text=elem.text.replace('1100',val_3)
    if elem.text:
        elem.text=elem.text.replace('1500',val_4)
    print(elem.text)

    output = ET.tostring(tree, 
                 encoding="UTF-8",
                 method="xml", 
                 xml_declaration=True, 
                 pretty_print=True)

    print(output.decode("utf-8"))

The Desired Output XML:

<?xml version="1.0" encoding="UTF-8"?>
<kml xmlns="http://www.opengis.net/kml/2.2" xmlns:gx="http://www.google.com/kml/ext/2.2">
<Overlay>
    <description>
    <![CDATA[
    <html>
    <head>
        <body>
            <div id="view">
                <div class="item">
                    <tr id="source">
                        <td class="raster">1900</td>
                        <td class="number">2000</td>
                        <td class="number">2100</td>
                    </tr>
                    <tr id="preview">
                        <td class="raster">Preview</td>
                        <td class="number">3000</td>
                        <td class="number">4000</td>
                    </tr>
                </div>
            </div>
        </body>
    </head>
    </html>
    ]]>
    </description>   
</Overlay></kml>

My main issue is correctly indexing/reading the attributes vs. hard-coding the desired values, as indexing them properly to find/replace with new values would be ideal. The above approach appears viable for XMLs without CDATA strings, but I cannot determine how to correctly parse the CDATA content, including properly writing of the XML to a file. Additionally, the opening and closing tags (<, >) are being incorrectly written as &gt and &lt within the XML. Any assistance is most appreciated!


Solution

  • Since the CDATA is an HTML string, I would extract it out of the XML, make changes to it and then reinsert it in the xml:

    #first edit
    cd = etree.fromstring(doc.xpath('//*[local-name()="description"]')[0].text) #out of the XML
    
    vals = ["1900","2000","3000","4000"]
    rems = ["Source","1800","1100","1500"]
    targets = cd.xpath('//tr//td')
    for target in targets:
        if target.text in rems:
            target.text=vals[rems.index(target.text)]
    #second edit
    doc.xpath('//*[local-name()="description"]')[0].text = etree.CDATA(etree.tostring(cd)) #... and back into the XML as CDATA
        
    print(ET.tostring(tree).decode())
    

    The output should be your expected output.