I'm trying to parse a big XML file to get all outer XML tag content, something like this:
<string name="key"><![CDATA[Hey I'm a tag with & and other characters]]></string>
to get this:
<![CDATA[Hey I'm a tag with & and other characters]]>
Although, when I use Nokogiri's SAX XML parser I only get the text without CDATA and with characters escaped, like this:
Hey I\'m a tag with & and other characters
This is my code:
class IDCollector < Nokogiri::XML::SAX::Document
def initialize
end
def characters string
puts string # this does not works, CDATA tag is not printed
end
def cdata_block string
puts string
puts "<![CDATA[" + string + "]]>"
end
end
Is there any way to do this with Nokogiri SAX?
It's not clear what you're trying to do, but this might help clear things up.
A <![CDATA[...]]>
entry isn't a tag, it's a block, and is treated differently by the parser. When the block is encountered the <![CDATA[
and ]]>
are stripped off so you'll only see the string inside. See "What does <![CDATA[]]> in XML mean?" for more information.
If you're trying to create a CDATA block in XML it can be done easily using:
doc = Nokogiri::XML(%(<string name="key"></string>))
doc.at('string') << Nokogiri::XML::CDATA.new(Nokogiri::XML::Document.new, "Hey I'm a tag with & and other characters")
doc.to_xml # => "<?xml version=\"1.0\"?>\n<string name=\"key\"><![CDATA[Hey I'm a tag with & and other characters]]></string>\n"
<<
is just shorthand to create a child node.
Trying to use inner_html
doesn't do what you want as it creates a text node as a child:
doc = Nokogiri::XML(%(<string name="key"></string>))
doc.at('string').inner_html = "Hey I'm a tag with & and other characters"
doc.to_xml # => "<?xml version=\"1.0\"?>\n<string name=\"key\">Hey I'm a tag with & and other characters</string>\n"
doc.at('string').children.first.text # => "Hey I'm a tag with & and other characters"
doc.at('string').children.first.class # => Nokogiri::XML::Text
Using inner_html
causes HTML encoding of the string to occur, which is the alternative way of embedding text that could include tags. Without the encoding or using CDATA
the XML parsers could get confused about what is text versus what is a real tag. I've written RSS aggregators, and having to deal with incorrectly encoded embedded HTML in a feed is a pain.