I am working with some XML data and I am stacked trying to remove CDATA in XML. I tried many ways, and it seems the simplier is by replacing all patterns
hey <![CDATA[mate - number 1]]> what's up
by
hey mate - number 1 what's up
Regex, in order to get the whole expression is (\<\!\[CDATA\[)(.*)(\]\]\>)
, so when using PERL (PCRE), I just need to replace by \2
.
By this, and taking advantage of Powershell, I am running in CMD:
powershell -Command "(gc Desktop\test_in.xml) -replace '(\<\!\[CDATA\[)(.*)(\]\]\>)', '\2' | Out-File Desktop\test_out.xml")
Although the result is everthing is replaced by string \2
, instead of mate - number 1
in the example.
Instead of \2
, I tried (?<=(\<\!\[CDATA\[))(.*?)(?=(\]\]\>))
since I am getting with this the inner part I am trying to keep, although the result is frustating, again literal replacing.
Any guess?
Thank you!
PS. If anyone know how to avoid this replacing in R, it is usefull as well.
Any XSLT that runs the Identity Transform (i.e., copies itself) will remove the <CData>
tags. Consider running with R's xslt
package or with PowerShell:
library(xml2)
library(xslt)
txt <- "<root>
<data>hey <![CDATA[mate - number 1]]> what's up</data>
</root>"
doc <- read_xml(txt)
txt <- '<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
<xsl:output indent="yes"/>
<xsl:strip-space elements="*"/>
<xsl:template match="@*|node()">
<xsl:copy>
<xsl:apply-templates select="@*|node()"/>
</xsl:copy>
</xsl:template>
</xsl:stylesheet>'
style <- read_xml(txt, package = "xslt")
new_xml <- xml_xslt(doc, style)
# Output
cat(as.character(new_xml))
# <?xml version="1.0" encoding="UTF-8"?>
# <root>
# <data>hey mate - number 1 what's up</data>
# </root>
Powershell
$xslt = New-Object System.Xml.Xsl.XslCompiledTransform;
$xslt.Load("C:\Path\To\Identity_Transform\Script.xsl");
$xslt.Transform("C:\Path\To\Input.xml", "C:\Path\To\Output.xml");