Search code examples
wordpressxsltrssxhtml-1.0-strictrss2

Cleaning CDATA in xml through xslt


I am trying to transform RSS 2 coming from Wordpress into XHTML 1.0 Strict (using a cronjob and xsltproc); however, Wordpress inserts an img into the CDATA at the end of the summary element. The img has a border attribute, which is invalid in XHTML 1.0 Strict. Because it's CDATA, I assume that means I can't match it with my XSLT. I can say for certain that the img is always the last thing before the CDATA ends. I'd prefer to strip the border attr and keep the image, but I'd rather get rid of the element entirely than have invalid markup.

Is it possible to match inside CDATA using XSLT, perhaps using a string expression? If so, is that the right way to go here, or is there a better solution to be had?


Solution

  • Remember what CDATA means: "character data". Putting something in CDATA means: this might look like markup, but I don't want you to treat it as markup. So if that thing inside the CDATA looks like an img element, the CDATA is there to tell you not to be fooled - it's not an element at all. Having said that, you can of course process the text in the way you process any other character string, including passing it to an XML parser to be turned into a tree of nodes.