I'm very new to Ruby, and trying to parse an XML document with REXML that has been previously pretty-printed (by REXML) with some slightly erratic results.
Some CDATA sections have a line break after the opening XML tag, but before the opening of the CDATA block, in these cases REXML parses the text of the tag as empty.
Here's an example XML document (much abridged):
<?xml version="1.0" encoding="utf-8"?>
<root-tag>
<content type="base64"><![CDATA[V2VsbCBkb25lISBJdCB3b3JrcyA6KQ==]]></content>
<content type="base64">
<![CDATA[VGhpcyB3b250IHdvcms=]]></content>
<content><![CDATA[This will work]]></content>
<content>
<![CDATA[This will not appear]]></content>
<content>
Seems happy</content>
<content>Obviously no problem</content>
</root-tag>
and here's my Ruby script (distilled down to a minimal example):
require 'rexml/document'
require 'base64'
include REXML
module RexmlSpike
file = File.new("ex.xml")
doc = Document.new file
doc.elements.each("root-tag/content") do |contentElement|
if contentElement.attributes["type"] == "base64"
puts "decoded: " << Base64.decode64(contentElement.text)
else
puts "raw: " << contentElement.text
end
end
puts "Finished."
end
The output I get is:
>> ruby spike.rb
decoded: Well done! It works :)
decoded:
raw: This will work
raw:
raw:
Seems happy
raw: Obviously no problem
Finished.
I'm using Ruby 1.9.3p392 on OSX Lion. The object of the exercise is ultimately to parse comments from some BlogML into the custom import XML used by Disqus.
Having anything before the <![CDATA[]]>
overrides whatever is in the <![CDATA[]]>
. Anything from a letter, to a newline (like you've discovered), or a single space. This makes sense, because your example is getting the text
of the element, and whitespace counts as text. In the examples where you are able to access <![CDATA[]]>
, it is because text is nil.
If you look at the documentation for Element, you'll see that it has a function called cdatas()
that:
Get an array of all CData children. IMMUTABLE.
So, in your example, if you do an inner loop on contentElement.cdatas()
you would see the content of all your missing tags.