Search code examples
rubyxpathweb-crawlernokogirirexml

How to crawl the right way?


I have been working and tinkering with Nokogiri, REXML & Ruby for a month. I have this giant database that I am trying to crawl. The things that I am scraping are HTML links and XML files.

There are exactly 43612 XML files that I want to crawl and store in a CSV file.

My script works if crawl maybe 500 xml files, but larger that takes too much time and it freezes or something.

I have divided the code in pieces here so it would be easy to read, the whole script/code is here: https://gist.github.com/1981074

I am using two libraries beacuse I couldn't find a way to do this all in nokogiri. I personally find REXML easier to use.

My question: How can fix it so it wont that a week for me to crawl all this? How do I make it run faster?

HERE IS MY SCRIPT:

Require the necessary lib:

require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'rexml/document'
require 'csv'
include REXML

Create bunch of array to store that grabs data:

@urls = Array.new 
@ID = Array.new
@titleSv = Array.new
@titleEn = Array.new
@identifier = Array.new
@typeOfLevel = Array.new

Grab all the xml links from a spec site and store them in a array called @urls

htmldoc = Nokogiri::HTML(open('http://testnavet.skolverket.se/SusaNavExport/EmilExporter?GetEvent&EMILVersion=1.1&NotExpired&EEFormOfStudy=normal&EIAcademicType=UoH&SelectEI'))

htmldoc.xpath('//a/@href').each do |links|
  @urls << links.content
end

Loop throw the @urls array, and grab every element node that I want to grab with xpath.

@urls.each do |url|
  # Loop throw the XML files and grab element nodes
  xmldoc = REXML::Document.new(open(url).read)
  # Root element
  root = xmldoc.root
  # Hämtar info-id
  @ID << root.attributes["id"]
  # TitleSv
  xmldoc.elements.each("/educationInfo/titles/title[1] | /ns:educationInfo/ns:titles/ns:title[1]"){
    |e| m = e.text 
        m = m.to_s
        next if m.empty? 
        @titleSv << m
  }

Then store them in a CSV file.

 CSV.open("eduction_normal.csv", "wb") do |row|
    (0..@ID.length - 1).each do |index|
      row << [@ID[index], @titleSv[index], @titleEn[index], @identifier[index], @typeOfLevel[index], @typeOfResponsibleBody[index], @courseTyp[index], @credits[index], @degree[index], @preAcademic[index], @subjectCodeVhs[index], @descriptionSv[index], @lastedited[index], @expires[index]]
    end
  end

Solution

  • It's hard to pinpoint the exact problem because of the way the code is structured. Here are a few suggestions to increase the speed and structure the program so that it will be easier to find what's blocking you.

    Libraries

    You're using a lot of libraries here that probably aren't necessary.

    You use both REXML and Nokogiri. They both do the same job. Except Nokogiri is much better at it (benchmark).

    Use Hashes

    Instead of storing data at index in 15 arrays, have one set of hashes.

    For instance,

    items = Set.new
    
    doc.xpath('//a/@href').each do |url|
      item = {}
      item[:url] = url.content
      items << item
    end
    
    items.each do |item|
      xml = Nokogiri::XML(open(item[:url]))
    
      item[:id] = xml.root['id']
      ...
    end
    

    Collect the data, then write to file

    Now that you have your items set, you can iterate over it and write to the file. This is much faster than doing it line by line.

    Be DRY

    In your original code, you have the same thing repeated a dozen times. Instead of copying and pasting, try instead to abstract out the common code.

    xmldoc.elements.each("/educationInfo/titles/title[1] | /ns:educationInfo/ns:titles/ns:title[1]"){
        |e| m = e.text 
         m = m.to_s
         next if m.empty? 
         @titleSv << m
    }
    

    Move what's common to a method

    def get_value(xml, path)
       str = ''
       xml.elements.each(path) do |e|
         str = e.text.to_s
         next if str.empty?
       end
    
       str
    end
    

    And move anything constant to another hash

    xml_paths = {
      :title_sv => "/educationInfo/titles/title[1] | /ns:educationInfo/ns:titles/ns:title[1]",
      :title_en => "/educationInfo/titles/title[2] | /ns:educationInfo/ns:titles/ns:title[2]",
      ...
    }
    

    Now you can combine these techniques to make for much cleaner codes

    item[:title_sv] = get_value(xml, xml_paths[:title_sv])
    item[:title_en] = get_value(xml, xml_paths[:title_en])
    

    I hope this helps!