Search code examples
rubyxmlcdatarexml

Why can't REXML parse CDATA preceded by a line break?


I'm very new to Ruby, and trying to parse an XML document with REXML that has been previously pretty-printed (by REXML) with some slightly erratic results.

Some CDATA sections have a line break after the opening XML tag, but before the opening of the CDATA block, in these cases REXML parses the text of the tag as empty.

  • Any idea if I can get REXML to read these lines?
  • If not, could I re-write them before hand with a regex or something?
  • Is this even Valid XML?

Here's an example XML document (much abridged):

<?xml version="1.0" encoding="utf-8"?>
<root-tag>
    <content type="base64"><![CDATA[V2VsbCBkb25lISBJdCB3b3JrcyA6KQ==]]></content>
    <content type="base64">
        <![CDATA[VGhpcyB3b250IHdvcms=]]></content>

    <content><![CDATA[This will work]]></content>
    <content>
        <![CDATA[This will not appear]]></content>

    <content>
        Seems happy</content>
    <content>Obviously no problem</content>
</root-tag>

and here's my Ruby script (distilled down to a minimal example):

require 'rexml/document'
require 'base64'
include REXML

module RexmlSpike
  file = File.new("ex.xml")
  doc = Document.new file
  doc.elements.each("root-tag/content") do |contentElement|
    if contentElement.attributes["type"] == "base64"
      puts "decoded: " << Base64.decode64(contentElement.text)
    else
      puts "raw: " << contentElement.text
    end
  end
  puts "Finished."
end

The output I get is:

>> ruby spike.rb
  decoded: Well done! It works :)
  decoded:
  raw: This will work
  raw:

  raw:
          Seems happy
  raw: Obviously no problem
  Finished.

I'm using Ruby 1.9.3p392 on OSX Lion. The object of the exercise is ultimately to parse comments from some BlogML into the custom import XML used by Disqus.


Solution

  • Why

    Having anything before the <![CDATA[]]> overrides whatever is in the <![CDATA[]]>. Anything from a letter, to a newline (like you've discovered), or a single space. This makes sense, because your example is getting the text of the element, and whitespace counts as text. In the examples where you are able to access <![CDATA[]]>, it is because text is nil.


    Solution

    If you look at the documentation for Element, you'll see that it has a function called cdatas() that:

    Get an array of all CData children. IMMUTABLE.

    So, in your example, if you do an inner loop on contentElement.cdatas() you would see the content of all your missing tags.