Search code examples
rubyrexml

Can I write an XML reader that can cope with unclosed tags?


I'm parsing the wikipedia XML dump using a REXML StreamListener. After a few million articles, it complains that it can't find a matching close tag, and skips the rest of the file.

Is there any way to get it to ignore the unclosed tag, and to resume parsing the stream after it?


Solution

  • The Nokogiri SAX mode is very similar to REXML's SAX (StreamListener) mode. Sample:

    require 'nokogiri'
    
    include Nokogiri
    
    class PostCallbacks < XML::SAX::Document
      def start_element(element, attributes)
        if element == 'tag'
          # Process tag data here
        end
      end
    end
    
    parser = XML::SAX::Parser.new(PostCallbacks.new)
    parser.parse_file("data.xml")
    

    Nokogiri also has a Reader interface which yields every node, in case you don't like the SAX-style callback interface.

    reader = Nokogiri::XML::Reader(xml)    
    reader.each do |node|
      # node is an instance of Nokogiri::XML::Reader
      puts node.name
    end
    

    The difference is that Nokogiri can recover from non-well-formedness better than pretty much any parser out there, thanks to the underlying libXML2 recover mode (on by default I believe).