Search code examples
rubyxmlnokogirisax

How to use SAX to get CDATA content


I'm trying to parse a big XML file to get all outer XML tag content, something like this:

<string name="key"><![CDATA[Hey I'm a tag with & and other characters]]></string>

to get this:

<![CDATA[Hey I'm a tag with & and other characters]]>

Although, when I use Nokogiri's SAX XML parser I only get the text without CDATA and with characters escaped, like this:

Hey I\'m a tag with &amp; and other characters

This is my code:

  class IDCollector < Nokogiri::XML::SAX::Document
    def initialize
    end

    def characters string
        puts string # this does not works, CDATA tag is not printed  
    end

    def cdata_block string
      puts string
      puts "<![CDATA[" + string + "]]>"
    end
  end

Is there any way to do this with Nokogiri SAX?


Solution

  • It's not clear what you're trying to do, but this might help clear things up.

    A <![CDATA[...]]> entry isn't a tag, it's a block, and is treated differently by the parser. When the block is encountered the <![CDATA[ and ]]> are stripped off so you'll only see the string inside. See "What does <![CDATA[]]> in XML mean?" for more information.

    If you're trying to create a CDATA block in XML it can be done easily using:

    doc = Nokogiri::XML(%(<string name="key"></string>))
    doc.at('string') << Nokogiri::XML::CDATA.new(Nokogiri::XML::Document.new, "Hey I'm a tag with & and other characters")
    doc.to_xml # => "<?xml version=\"1.0\"?>\n<string name=\"key\"><![CDATA[Hey I'm a tag with & and other characters]]></string>\n"
    

    << is just shorthand to create a child node.

    Trying to use inner_html doesn't do what you want as it creates a text node as a child:

    doc = Nokogiri::XML(%(<string name="key"></string>))
    doc.at('string').inner_html = "Hey I'm a tag with & and other characters"
    doc.to_xml # => "<?xml version=\"1.0\"?>\n<string name=\"key\">Hey I'm a tag with &amp; and other characters</string>\n"
    doc.at('string').children.first.text # => "Hey I'm a tag with & and other characters"
    doc.at('string').children.first.class # => Nokogiri::XML::Text
    

    Using inner_html causes HTML encoding of the string to occur, which is the alternative way of embedding text that could include tags. Without the encoding or using CDATA the XML parsers could get confused about what is text versus what is a real tag. I've written RSS aggregators, and having to deal with incorrectly encoded embedded HTML in a feed is a pain.