Search code examples
xmlruby-on-rails-3nokogirisax

How to use SAX with Nokogiri?


I want to parse a very big file 240Mb, and have to SAX to avoid to load the file in memory.

My XML looks like:

<?xml version="1.0" encoding="utf-8"?>
<hotels>
  <hotel>
    <hotelId>1568054</hotelId>
    <hotelFileName>Der_Obere_Wirt_zum_Queri</hotelFileName>
    <hotelName>"Der Obere Wirt" zum Queri</hotelName>
    <rating>3</rating>
    <cityId>34633</cityId>
    <cityFileName>Andechs</cityFileName>
    <cityName>Andechs</cityName>
    <stateId>212</stateId>
    <stateFileName>Bavaria</stateFileName>
    <stateName>Bavaria</stateName>
    <countryCode>DE</countryCode>
    <countryFileName>Germany</countryFileName>
    <countryName>Germany</countryName>
    <imageId>51498149</imageId>
    <Address>Georg Queri Ring 9</Address>
    <minRate>85.9800</minRate>
    <currencyCode>EUR</currencyCode>
    <Latitude>48.009423000000</Latitude>
    <Longitude>11.214504000000</Longitude>
    <NumberOfReviews>16</NumberOfReviews>
    <ConsumerRating>4.25</ConsumerRating>
    <PropertyType>0</PropertyType>
    <ChainID>0</ChainID>
    <Facilities>1|3|5|8|22|27|45|49|53|56|64|66|67|139|202|209|213|256|</Facilities>
  </hotel>
  <hotel>
    <hotelId>1658359</hotelId>
    <hotelFileName>Seclusions_of_Yallingup</hotelFileName>
    <hotelName>"Seclusions" of Yallingup</hotelName>
    <rating>4</rating>
    <cityId>72257</cityId>
    <cityFileName>Yallingup</cityFileName>
    <cityName>Yallingup</cityName>
    <stateId>172</stateId>
    <stateFileName>Western_Australia</stateFileName>
    <stateName>Western Australia</stateName>
    <countryCode>AU</countryCode>
    <countryFileName>Australia</countryFileName>
    <countryName>Australia</countryName>
    <imageId>53234107</imageId>
    <Address>58 Zamia Grove</Address>
    <minRate>218.1825</minRate>
    <currencyCode>AUD</currencyCode>
    <Latitude>-33.691192000000</Latitude>
    <Longitude>115.061938999999</Longitude>
    <NumberOfReviews>0</NumberOfReviews>
    <ConsumerRating>0</ConsumerRating>
    <PropertyType>3</PropertyType>
    <ChainID>0</ChainID>
     <Facilities>3|6|13|14|21|22|28|39|40|41|51|53|54|56|57|58|65|66|141|191|202|204|209|210|211|292|</Facilities>
  </hotel>
  <hotel>
    <hotelId>1491947</hotelId>
    <hotelFileName>1_Melrose_Blvd</hotelFileName>
    <hotelName>#1 Melrose Blvd</hotelName>
    <rating>5</rating>
    <cityId>964</cityId>
    <cityFileName>Johannesburg</cityFileName>
    <cityName>Johannesburg</cityName>
    <stateId/>
    <stateFileName/>
    <stateName/>
    <countryCode>ZA</countryCode>
    <countryFileName>South_Africa</countryFileName>
    <countryName>South Africa</countryName>
    <imageId>46777171</imageId>
    <Address>1 Melrose Boulevard Melrose Arch</Address>
    <minRate/>
    <currencyCode>ZAR</currencyCode>
    <Latitude>-26.135656000000</Latitude>
    <Longitude>28.067751000000</Longitude>
    <NumberOfReviews>0</NumberOfReviews>
    <ConsumerRating>0</ConsumerRating>
    <PropertyType>9</PropertyType>
    <ChainID>0</ChainID>
    <Facilities>6|7|9|11|12|15|17|18|21|32|34|39|41|42|50|51|56|58|60|140|173|202|293|296|</Facilities>
  </hotel>
  <hotel>
    <hotelId>1726938</hotelId>
    <hotelFileName>1_Value_Inn_Clovis</hotelFileName>
    <hotelName>#1 Value Inn Clovis</hotelName>
    <rating>2</rating>
    <cityId>28538</cityId>
    <cityFileName>Clovis_New_Mexico</cityFileName>
    <cityName>Clovis (New Mexico)</cityName>
    <stateId>32</stateId>
    <stateFileName>New_Mexico</stateFileName>
    <stateName>New Mexico</stateName>
    <countryCode>US</countryCode>
    <countryFileName>United_States</countryFileName>
    <countryName>United States</countryName>
    <imageId/>
    <Address>1720 Mabry</Address>
    <minRate/>
    <currencyCode>USD</currencyCode>
    <Latitude>34.396549224853</Latitude>
    <Longitude>-103.182769775390</Longitude>
    <NumberOfReviews>0</NumberOfReviews>
    <ConsumerRating>0</ConsumerRating>
    <PropertyType>2</PropertyType>
    <ChainID>0</ChainID>
    <Facilities>6|7|8|18|21|22|27|41|50|52|56|222|281|292|</Facilities>
  </hotel>
</hotels>

I tried this code:

class Wikihandler  < Nokogiri::XML::SAX::Document

  def initialize
    # do one-time setup here, called as part of Class.new
  end

  def start_element(name, attributes = [])
  # check the element name here and create an active record object if appropriate
   if name == 'hotel'
    a = Hash[*attributes]
    puts attributes
    # more business...
   end
  end

  def characters(s)
     # save the characters that appear here and possibly use them in the current tag object
  end

  def end_element(name)
     # check the tag name and possibly use the characters you've collected
     # and save your activerecord object now
  end

end

parser = Nokogiri::XML::SAX::Parser.new(Wikihandler.new)
parser.parse_file('HotelCombinedXml/Hotels_All.xml')

I can access the label of the tag but how can I access its content?


Solution

  • Wikihandler#characters will display the content. You could do something like:

    class MyDocument < Nokogiri::XML::SAX::Document
      attr_accessor :is_name
    
      def initialize
        @is_name = false
      end
    
      def end_document
        puts "the document has ended"
      end
    
      def start_element name, attributes = []
        @is_name = name.eql?("hotelName")
      end
    
      def characters string
        string.strip!
        if @is_name and !string.empty?
          puts "Name: #{string}"
        end
      end
    end
    

    However, if you want to make your life easier, I'd suggest checking out sax-machine. It adds some nice functionality and (IMHO) a friendlier interface to Nokogiri's SAX parser. Here is some sample code and specs:

    require "sax-machine"
    require "rspec"
    
    XML = <<XML
    <?xml version="1.0" encoding="utf-8"?>
    <hotels>
      <hotel>
        <hotelId>1568054</hotelId>
        <hotelFileName>Der_Obere_Wirt_zum_Queri</hotelFileName>
        <hotelName>"Der Obere Wirt" zum Queri</hotelName>
        <rating>3</rating>
        <cityId>34633</cityId>
        <cityFileName>Andechs</cityFileName>
        <cityName>Andechs</cityName>
        <stateId>212</stateId>
        <stateFileName>Bavaria</stateFileName>
        <stateName>Bavaria</stateName>
        <countryCode>DE</countryCode>
        <countryFileName>Germany</countryFileName>
        <countryName>Germany</countryName>
        <imageId>51498149</imageId>
        <Address>Georg Queri Ring 9</Address>
        <minRate>85.9800</minRate>
        <currencyCode>EUR</currencyCode>
        <Latitude>48.009423000000</Latitude>
        <Longitude>11.214504000000</Longitude>
        <NumberOfReviews>16</NumberOfReviews>
        <ConsumerRating>4.25</ConsumerRating>
        <PropertyType>0</PropertyType>
        <ChainID>0</ChainID>
        <Facilities>1|3|5|8|22|27|45|49|53|56|64|66|67|139|202|209|213|256|</Facilities>
      </hotel>
      <hotel>
        <hotelId>1658359</hotelId>
        <hotelFileName>Seclusions_of_Yallingup</hotelFileName>
        <hotelName>"Seclusions" of Yallingup</hotelName>
        <rating>4</rating>
        <cityId>72257</cityId>
        <cityFileName>Yallingup</cityFileName>
        <cityName>Yallingup</cityName>
        <stateId>172</stateId>
        <stateFileName>Western_Australia</stateFileName>
        <stateName>Western Australia</stateName>
        <countryCode>AU</countryCode>
        <countryFileName>Australia</countryFileName>
        <countryName>Australia</countryName>
        <imageId>53234107</imageId>
        <Address>58 Zamia Grove</Address>
        <minRate>218.1825</minRate>
        <currencyCode>AUD</currencyCode>
        <Latitude>-33.691192000000</Latitude>
        <Longitude>115.061938999999</Longitude>
        <NumberOfReviews>0</NumberOfReviews>
        <ConsumerRating>0</ConsumerRating>
        <PropertyType>3</PropertyType>
        <ChainID>0</ChainID>
        <Facilities>3|6|13|14|21|22|28|39|40|41|51|53|54|56|57|58|65|66|141|191|202|204|209|210|211|292|</Facilities>
      </hotel>
      <hotel>
        <hotelId>1491947</hotelId>
        <hotelFileName>1_Melrose_Blvd</hotelFileName>
        <hotelName>#1 Melrose Blvd</hotelName>
        <rating>5</rating>
        <cityId>964</cityId>
        <cityFileName>Johannesburg</cityFileName>
        <cityName>Johannesburg</cityName>
        <stateId/>
        <stateFileName/>
        <stateName/>
        <countryCode>ZA</countryCode>
        <countryFileName>South_Africa</countryFileName>
        <countryName>South Africa</countryName>
        <imageId>46777171</imageId>
        <Address>1 Melrose Boulevard Melrose Arch</Address>
        <minRate/>
        <currencyCode>ZAR</currencyCode>
        <Latitude>-26.135656000000</Latitude>
        <Longitude>28.067751000000</Longitude>
        <NumberOfReviews>0</NumberOfReviews>
        <ConsumerRating>0</ConsumerRating>
        <PropertyType>9</PropertyType>
        <ChainID>0</ChainID>
        <Facilities>6|7|9|11|12|15|17|18|21|32|34|39|41|42|50|51|56|58|60|140|173|202|293|296|</Facilities>
      </hotel>
      <hotel>
        <hotelId>1726938</hotelId>
        <hotelFileName>1_Value_Inn_Clovis</hotelFileName>
        <hotelName>#1 Value Inn Clovis</hotelName>
        <rating>2</rating>
        <cityId>28538</cityId>
        <cityFileName>Clovis_New_Mexico</cityFileName>
        <cityName>Clovis (New Mexico)</cityName>
        <stateId>32</stateId>
        <stateFileName>New_Mexico</stateFileName>
        <stateName>New Mexico</stateName>
        <countryCode>US</countryCode>
        <countryFileName>United_States</countryFileName>
        <countryName>United States</countryName>
        <imageId/>
        <Address>1720 Mabry</Address>
        <minRate/>
        <currencyCode>USD</currencyCode>
        <Latitude>34.396549224853</Latitude>
        <Longitude>-103.182769775390</Longitude>
        <NumberOfReviews>0</NumberOfReviews>
        <ConsumerRating>0</ConsumerRating>
        <PropertyType>2</PropertyType>
        <ChainID>0</ChainID>
        <Facilities>6|7|8|18|21|22|27|41|50|52|56|222|281|292|</Facilities>
      </hotel>
    </hotels>
    XML
    
    class Hotel
      include SAXMachine
      element :hotelId, :as => :id
      element :hotelName, :as => :name
    end
    
    class Wikihandler
      include SAXMachine
      elements :hotel, :as => :hotels, :class => Hotel
    end
    
    describe Wikihandler do
      before(:all) do
        @parser = Wikihandler.new
        @parser.parse XML
      end
    
      it "should parse the proper number of hotels" do
        @parser.hotels.count.should eq 4
      end
    
      it "should parse the hotel id of each entry" do
        @parser.hotels[0].id.should eq "1568054"
      end
    
      it "should parse the hotel name of each entry" do
        @parser.hotels[0].name.should eq '"Der Obere Wirt" zum Queri'
      end
    end