I am parsing several XML document feeds with BeautifulSoup, and would like to do some preprocessing to replace non-standard CDATA
tags with custom XML tags. To illustrate:
The following XML source...
<title>The end of the world as we know it</title>
<category><![CDATA[Planking Dancing]]></category>
<pubDate><![CDATA[Sun, 16 Sep 2012 12:00:00 EDT]]></pubDate>
<dc:creator><![CDATA[Bart Simpson]]></dc:creator>
...would turn into:
<title>The end of the world as we know it</title>
<category><myTag>Planking Dancing<myTag></category>
<pubDate><myTag>Sun, 16 Sep 2012 12:00:00 EDT<myTag></pubDate>
<dc:creator><myTag>Bart Simpson<myTag></dc:creator>
I don't think this question has been asked before on SO (I tried a few different SO queries). I've also tried a few different approaches using .findAll('cdata', text=True)
and the applying the BeautifulSoup replaceWith()
method to each resulting NavigableString
. The attempts I've made have resulted in either no substitution, or what looks like a recursive loop.
I'm happy to post my previous attempts, but given that the problem here is quite simple I'm hoping someone can post a clear example of how to accomplish the search-and-replace above using BeautifulSoup 3.
CData
is a subclass of NavigableString
, so you can find all CData
elements by first searching for all NavigableString
objects, and then testing
whether each is an instance of CData
. Once you've got one, it's easily
replaced using replaceWith
, as you suggested:
>>> from BeautifulSoup import BeautifulSoup, CData, Tag
>>> source = """
... <title>The end of the world as we know it</title>
... <category><![CDATA[Planking Dancing]]></category>
... <pubDate><![CDATA[Sun, 16 Sep 2012 12:00:00 EDT]]></pubDate>
... <dc:creator><![CDATA[Bart Simpson]]></dc:creator>
... """
>>> soup = BeautifulSoup(source)
>>> for navstr in soup(text=True):
... if isinstance(navstr, CData):
... tag = Tag(soup, "myTag")
... tag.insert(0, navstr[:])
... navstr.replaceWith(tag)
...
>>> soup
<title>The end of the world as we know it</title>
<category><myTag>Planking Dancing</myTag></category>
<pubdate><myTag>Sun, 16 Sep 2012 12:00:00 EDT</myTag></pubdate>
<dc:creator><myTag>Bart Simpson</myTag></dc:creator>
>>>
A couple of notes:
you can call a BeautifulSoup
object as though it were a function, and the
effect is the same as calling its .findAll()
method.
The only way I know to get the content of a CData
object in BS3 is to slice
it, as in the snippet above. str(navstr)
would keep all the
<![CDATA[...]]>
junk, which obviously you don't want. In BS4, str(navstr)
gives you the content without the junk.