Search code examples
rubyxmlrexml

Invalid characters before my XML in Ruby


When I look in an XML file, it looks fine, and starts with <?xml version="1.0" encoding="utf-16le" standalone="yes"?>

But when I read it in Ruby and print it to stout, there are two ?s in front of that: ??<?xml version="1.0" encoding="utf-16le" standalone="yes"?>

Where do these come from, and how do I remove them? Parsing it like this with REXML fails immediately. Removing the first to characters and then parsing it, gives me this error:

REXML::ParseException: #<REXML::ParseException: malformed XML: missing tag start Line: Position: Last 80 unconsumed characters: <?xml version="1.0" encoding="utf-16le" s>

What is the right way to handle this?

Edit: Below is my code. The ftp.get downloads the xml from an ftp server. (I wonder if that might be relevant.)

xml = ftp.get
puts xml
until xml[0,1] == "<"  # to remove the 2 invalid characters
  puts xml[0,2]
  xml.slice! 0
end
puts xml
document = REXML::Document.new(xml)

The last puts prints the correct xml. But because of the two invalid characters, I've got the feeling something else went wrong. It shouldn't be necessary to remove anything. I'm at a loss what the problem might be, though.

Edit 2: I'm using Net::FTP to download the XML, but with this new method that lets me read the contents into a string instead of a file:

class Net::FTP

  def gettextcontent(remotefile, &block) # :yield: line
    f = StringIO.new()
    begin
      retrlines("RETR " + remotefile) do |line|
        f.puts(line)
        yield(line) if block
      end
    ensure
      f.close
      return f
    end
  end
end

Edit 3: It seems to be caused by StringIO (in Ruby 1.8.7) not supporting unicode. I'm not sure if there's a workaround for that.


Solution

  • To answer my own question, the real problem here is that encoding support in Ruby 1.8.7 is lacking. StringIO is particular seems to make a mess of it. REXML also has trouble handling unicode in Ruby 1.8.7.

    The most attractive solution would be of course to upgrade to 1.9.3, but that's not practical for this project right now.

    So what I ended up doing is, avoid StringIO and simply download to a file on disk, and then instead of processing the XML with REXML, use nokogiri instead.

    Together, that solves all my problems.