Search code examples
pythonxmlbeautifulsouplxmlcdata

How would one remove the CDATA tags from but preserve the actual data in Python using LXML or BeautifulSoup


I have some XML I am parsing in which I am using BeautifulSoup as the parser. I pull the CDATA out with the following code, but I only want the data and not the CDATA TAGS.

    myXML = open("c:\myfile.xml", "r")
    soup = BeautifulSoup(myXML)
    data = soup.find(text=re.compile("CDATA"))

    print data

    <![CDATA[TEST DATA]]>

What I would like to see if the following output:

TEST DATA

I don't care if the solution is in LXML or BeautifulSoup. Just want the best or easiest way to get the job done. Thanks!


Here is a solution:

    parser = etree.XMLParser(strip_cdata=False)
    root = etree.parse(self.param1, parser)
    data = root.findall('./config/script')
    for item in data:  # iterate through list to find text contained in elements containing CDATA
        print item.text

Solution

  • Based on the lxml docs:

    >>> from lxml import etree
    >>> parser = etree.XMLParser(strip_cdata=False)
    >>> root = etree.XML('<root><data><![CDATA[test]]></data></root>', parser)
    >>> data = root.findall('data')
    >>> for item in data:  # iterate through list to find text contained in elements containing CDATA
        print item.text
    
    test  # just the text of <![CDATA[test]]>
    

    This might be the best way to get the job done, depending on how amenable your xml structure is to this approach.