Search code examples
rubyxmlnokogiribuilder

creating large file xml in ruby


I want to write approximately 50MB of data to an XML file.

I found Nokogiri (1.5.0) to be efficient for parsing when just reading and not writing. Nokogiri is not a good option to write to an XML file since it holds the complete XML data in memory until it finally writes it.

I found Builder (3.0.0) to be a good option but I'm not sure if it's the best option.

I tried some benchmarks with the following simple code:

  (1..500000).each do |k|
    xml.products {
      xml.widget {
        xml.id_ k
        xml.name "Awesome widget"
      }
    }
    end

Nokogiri takes about 143 seconds and also memory consumption gradually increased and ended at about 700 MB.

Builder took about 123 seconds and memory consumption was stable enough at 10 MB.

So is there a better solution to write huge XML files (50 MB) in Ruby?

Here's the code using Nokogiri:

require 'rubygems'
require 'nokogiri'
a = Time.now
builder = Nokogiri::XML::Builder.new do |xml|
  xml.root {
    (1..500000).each do |k|
    xml.products {
      xml.widget {
        xml.id_ k
        xml.name "Awesome widget"
      }
    }
    end
  }
end
o = File.new("test_noko.xml", "w")
o.write(builder.to_xml)
o.close
puts (Time.now-a).to_s

Here's the code using Builder:

require 'rubygems'
require 'builder'
a = Time.now
File.open("test.xml", 'w') {|f|
xml = Builder::XmlMarkup.new(:target => f, :indent => 1)

  (1..500000).each do |k|
    xml.products {
      xml.widget {
        xml.id_ k
        xml.name "Awesome widget"
      }
    }
    end

}
puts (Time.now-a).to_s

Solution

  • Solution 1

    If speed is your main concern, I'd just use libxml-ruby directly:

    $ time ruby test.rb 
    
    real    0m7.352s
    user    0m5.867s
    sys     0m0.921s
    

    The API is pretty straight forward:

    require 'rubygems'
    require 'xml'
    doc = XML::Document.new()
    doc.root = XML::Node.new('root_node')
    root = doc.root
    
    500000.times do |k|
      root << elem1 = XML::Node.new('products')
      elem1 << elem2 = XML::Node.new('widget')
      elem2['id'] = k.to_s
      elem2['name'] = 'Awesome widget'
    end
    
    doc.save('foo.xml', :indent => false, :encoding => XML::Encoding::UTF_8)
    

    Using :indent => true doesn't make much difference in this case, but for more complex XML files it might.

    $ time ruby test.rb #(with indent)
    
    real    0m7.395s
    user    0m6.050s
    sys     0m0.847s
    

    Solution 2

    Of course the fastest solution, and that doesn't build up on memory is just to write the XML manually but that will easily generate other sources of error like possibly invalid XML:

    $ time ruby test.rb 
    
    real    0m1.131s
    user    0m0.873s
    sys     0m0.126s
    

    Here's the code:

    f = File.open("foo.xml", "w")
    f.puts('<doc>')
    500000.times do |k|
      f.puts "<product><widget id=\"#{k}\" name=\"Awesome widget\" /></product>"
    end
    f.puts('</doc>')
    f.close