I'm parsing the wikipedia XML dump using a REXML StreamListener. After a few million articles, it complains that it can't find a matching close tag, and skips the rest of the file.
Is there any way to get it to ignore the unclosed tag, and to resume parsing the stream after it?
The Nokogiri SAX mode is very similar to REXML's SAX (StreamListener) mode. Sample:
require 'nokogiri'
include Nokogiri
class PostCallbacks < XML::SAX::Document
def start_element(element, attributes)
if element == 'tag'
# Process tag data here
end
end
end
parser = XML::SAX::Parser.new(PostCallbacks.new)
parser.parse_file("data.xml")
Nokogiri also has a Reader interface which yields every node, in case you don't like the SAX-style callback interface.
reader = Nokogiri::XML::Reader(xml)
reader.each do |node|
# node is an instance of Nokogiri::XML::Reader
puts node.name
end
The difference is that Nokogiri can recover from non-well-formedness better than pretty much any parser out there, thanks to the underlying libXML2 recover mode (on by default I believe).