Search code examples
c#xmlcdataurl-encoding

Remove all CDATA nodes and replace with encoded text


So, I've got a massive XML file and I want to remove all CDATA sections and replace the CDATA node contents with safe, html encoded text nodes.

Just stripping out the CDATA with a regex will of course break the parsing. Is there a LINQ or XmlDocument or XmlTextWriter technique to swap out the CDATA with encoded text?

I'm not too concerned with the final encoding quite yet, just how to replace the sections with the encoding of my choice.

Original Example

  ---
  <COLLECTION type="presentation" autoplay="false">
    <TITLE><![CDATA[Rights & Responsibilities]]></TITLE>
    <ITEM id="2802725d-dbac-e011-bcd6-005056af18ff" presenterGender="male">
      <TITLE><![CDATA[Watch the demo]]></TITLE>
      <LINK><![CDATA[_assets/2302725d-dbac-e011-bcd6-005056af18ff/presentation/presentation-00000000.mp4]]></LINK>
    </ITEM>
  </COLLECTION>
  ---

Sould Become

          <COLLECTION type="presentation" autoplay="false">
            <TITLE>Rights &amp; Responsibilities</TITLE>
            <ITEM id="2802725d-dbac-e011-bcd6-005056af18ff" presenterGender="male">
              <TITLE>Watch the demo</TITLE>
              <LINK>_assets/2302725d-dbac-e011-bcd6-005056af18ff/presentation/presentation-00000000.mp4</LINK>
            </ITEM>
          </COLLECTION>

I guess the ultimate goal is to move to JSON. I've tried this

            XmlDocument doc = new XmlDocument();
            doc.Load(Server.MapPath( @"~/somefile.xml"));
            string jsonText = JsonConvert.SerializeXmlNode(doc);

But I end up with ugly nodes, i.e. "#cdata-section" keys. It would take WAAAAY to many hours to have the front end re-developed to accept this.

"COLLECTION":[{"@type":"whitepaper","TITLE":{"#cdata-section":"SUPPORTING DOCUMENTS"}},{"@type":"presentation","@autoplay":"false","TITLE":{"#cdata-section":"Demo Presentation"},"ITEM":{"@id":"2802725d-dbac-e011-bcd6-005056af18ff","@presenterGender":"male","TITLE":{"#cdata-section":"Watch the demo"},"LINK":{"#cdata-section":"_assets/2302725d-dbac-e011-bcd6-005056af18ff/presentation/presentation-00000000.mp4"}

Solution

  • Process the XML with a XSLT that just copies input to output - C# code:

      XslCompiledTransform transform = new XslCompiledTransform();
      transform.Load(@"c:\temp\id.xslt");
      transform.Transform(@"c:\temp\cdata.xml", @"c:\temp\clean.xml");
    

    id.xslt:

    <?xml version="1.0" encoding="utf-8"?>
    <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
    
        <xsl:output method="xml" indent="yes"/>
    
        <xsl:template match="@* | node()">
            <xsl:copy>
                <xsl:apply-templates select="@* | node()"/>
            </xsl:copy>
        </xsl:template>
    </xsl:stylesheet>