python xml xml-parsing beautifulsoup cdata

Replacing CDATA NavigableStrings with Tags in BeautifulSoup

I am parsing several XML document feeds with BeautifulSoup, and would like to do some preprocessing to replace non-standard CDATA tags with custom XML tags. To illustrate:

The following XML source...

<title>The end of the world as we know it</title>
<category><![CDATA[Planking Dancing]]></category>
<pubDate><![CDATA[Sun, 16 Sep 2012 12:00:00 EDT]]></pubDate>
<dc:creator><![CDATA[Bart Simpson]]></dc:creator>

...would turn into:

<title>The end of the world as we know it</title>
<category><myTag>Planking Dancing<myTag></category>
<pubDate><myTag>Sun, 16 Sep 2012 12:00:00 EDT<myTag></pubDate>
<dc:creator><myTag>Bart Simpson<myTag></dc:creator>

I don't think this question has been asked before on SO (I tried a few different SO queries). I've also tried a few different approaches using .findAll('cdata', text=True) and the applying the BeautifulSoup replaceWith() method to each resulting NavigableString. The attempts I've made have resulted in either no substitution, or what looks like a recursive loop.

I'm happy to post my previous attempts, but given that the problem here is quite simple I'm hoping someone can post a clear example of how to accomplish the search-and-replace above using BeautifulSoup 3.

Solution

CData is a subclass of NavigableString, so you can find all CData elements by first searching for all NavigableString objects, and then testing whether each is an instance of CData. Once you've got one, it's easily replaced using replaceWith, as you suggested:

>>> from BeautifulSoup import BeautifulSoup, CData, Tag
>>> source = """
... <title>The end of the world as we know it</title>
... <category><![CDATA[Planking Dancing]]></category>
... <pubDate><![CDATA[Sun, 16 Sep 2012 12:00:00 EDT]]></pubDate>
... <dc:creator><![CDATA[Bart Simpson]]></dc:creator>
... """
>>> soup = BeautifulSoup(source)
>>> for navstr in soup(text=True):
...     if isinstance(navstr, CData):
...         tag = Tag(soup, "myTag")
...         tag.insert(0, navstr[:])
...         navstr.replaceWith(tag)
... 
>>> soup

<title>The end of the world as we know it</title>
<category><myTag>Planking Dancing</myTag></category>
<pubdate><myTag>Sun, 16 Sep 2012 12:00:00 EDT</myTag></pubdate>
<dc:creator><myTag>Bart Simpson</myTag></dc:creator>

>>>

A couple of notes:

you can call a BeautifulSoup object as though it were a function, and the effect is the same as calling its .findAll() method.
The only way I know to get the content of a CData object in BS3 is to slice it, as in the snippet above. str(navstr) would keep all the <![CDATA[...]]> junk, which obviously you don't want. In BS4, str(navstr) gives you the content without the junk.