Search code examples
rregexxmlcmdpcre

Remove CDATA from XML with regex in Windows CMD (powershell)


I am working with some XML data and I am stacked trying to remove CDATA in XML. I tried many ways, and it seems the simplier is by replacing all patterns

hey <![CDATA[mate - number 1]]> what's up

by

hey mate - number 1 what's up

Regex, in order to get the whole expression is (\<\!\[CDATA\[)(.*)(\]\]\>), so when using PERL (PCRE), I just need to replace by \2.

By this, and taking advantage of Powershell, I am running in CMD:

powershell -Command "(gc Desktop\test_in.xml) -replace '(\<\!\[CDATA\[)(.*)(\]\]\>)', '\2' | Out-File Desktop\test_out.xml")

Although the result is everthing is replaced by string \2, instead of mate - number 1 in the example.

Instead of \2, I tried (?<=(\<\!\[CDATA\[))(.*?)(?=(\]\]\>)) since I am getting with this the inner part I am trying to keep, although the result is frustating, again literal replacing.

Any guess?

Thank you!

PS. If anyone know how to avoid this replacing in R, it is usefull as well.


Solution

  • Any XSLT that runs the Identity Transform (i.e., copies itself) will remove the <CData> tags. Consider running with R's xslt package or with PowerShell:

    library(xml2)
    library(xslt)
    
    txt <- "<root>
                  <data>hey <![CDATA[mate - number 1]]> what's up</data>
           </root>"    
    doc <- read_xml(txt)
    
    txt <- '<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
                <xsl:output indent="yes"/>
                <xsl:strip-space elements="*"/>
    
                <xsl:template match="@*|node()">
                  <xsl:copy>
                     <xsl:apply-templates select="@*|node()"/>
                  </xsl:copy>
                </xsl:template>
    
             </xsl:stylesheet>'    
    style <- read_xml(txt, package = "xslt")
    
    new_xml <- xml_xslt(doc, style)
    
    # Output
    cat(as.character(new_xml))
    
    # <?xml version="1.0" encoding="UTF-8"?>
    # <root>
    #    <data>hey mate - number 1 what's up</data>
    # </root>
    

    Powershell

    $xslt = New-Object System.Xml.Xsl.XslCompiledTransform;
    
    $xslt.Load("C:\Path\To\Identity_Transform\Script.xsl");
    $xslt.Transform("C:\Path\To\Input.xml", "C:\Path\To\Output.xml");