Search code examples
rubyxmlxml-parsingrubygemsnokogiri

Parsing a non-XML document with Nokogiri when the node names are/contain integers


When I run:

#!/usr/bin/env ruby
require 'nokogiri'

xml = <<-EOXML
<pajamas>
  <bananas>
    <foo>bar</foo>
    <bar>bar</bar>
    <1>bar</1>
  </bananas>
</pajamas>
EOXML

doc = Nokogiri::XML(xml)
puts doc.at('/pajamas/bananas/foo')
puts doc.at('/pajamas/bananas/bar')
puts doc.at('/pajamas/bananas/1')

I get an ERROR: Invalid expression: /pajamas/bananas/1 (Nokogiri::XML::XPath::SyntaxError)

Is this a case of Nokogiri not liking ints as node names and/or is there a work around?

Looking at the documentation, I did not see a workaround to this. Removing the last line eliminates the error and prints the first two nodes as expected.


Solution

  • An XML element with a name that starts with a number is invalid XML.

    XML elements must follow these naming rules:

    • Names can contain letters, numbers, and other characters
    • Names cannot start with a number or punctuation character
    • Names cannot start with the letters xml (or XML, or Xml, etc)
    • Names cannot contain spaces Any name can be used, no words are reserved.

    You're trying to parse invalid XML with a XML parser, it's just not going to work. If you're really getting <1> as a tag and can't control that somehow, I'd suggest replacing the tags using a regex before getting to nokogiri.