I have been working and tinkering with Nokogiri, REXML & Ruby for a month. I have this giant database that I am trying to crawl. The things that I am scraping are HTML links and XML files.
There are exactly 43612 XML files that I want to crawl and store in a CSV file.
My script works if crawl maybe 500 xml files, but larger that takes too much time and it freezes or something.
I have divided the code in pieces here so it would be easy to read, the whole script/code is here: https://gist.github.com/1981074
I am using two libraries beacuse I couldn't find a way to do this all in nokogiri. I personally find REXML easier to use.
My question: How can fix it so it wont that a week for me to crawl all this? How do I make it run faster?
HERE IS MY SCRIPT:
Require the necessary lib:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'rexml/document'
require 'csv'
include REXML
Create bunch of array to store that grabs data:
@urls = Array.new
@ID = Array.new
@titleSv = Array.new
@titleEn = Array.new
@identifier = Array.new
@typeOfLevel = Array.new
Grab all the xml links from a spec site and store them in a array called @urls
htmldoc = Nokogiri::HTML(open('http://testnavet.skolverket.se/SusaNavExport/EmilExporter?GetEvent&EMILVersion=1.1&NotExpired&EEFormOfStudy=normal&EIAcademicType=UoH&SelectEI'))
htmldoc.xpath('//a/@href').each do |links|
@urls << links.content
end
Loop throw the @urls array, and grab every element node that I want to grab with xpath.
@urls.each do |url|
# Loop throw the XML files and grab element nodes
xmldoc = REXML::Document.new(open(url).read)
# Root element
root = xmldoc.root
# Hämtar info-id
@ID << root.attributes["id"]
# TitleSv
xmldoc.elements.each("/educationInfo/titles/title[1] | /ns:educationInfo/ns:titles/ns:title[1]"){
|e| m = e.text
m = m.to_s
next if m.empty?
@titleSv << m
}
Then store them in a CSV file.
CSV.open("eduction_normal.csv", "wb") do |row|
(0..@ID.length - 1).each do |index|
row << [@ID[index], @titleSv[index], @titleEn[index], @identifier[index], @typeOfLevel[index], @typeOfResponsibleBody[index], @courseTyp[index], @credits[index], @degree[index], @preAcademic[index], @subjectCodeVhs[index], @descriptionSv[index], @lastedited[index], @expires[index]]
end
end
It's hard to pinpoint the exact problem because of the way the code is structured. Here are a few suggestions to increase the speed and structure the program so that it will be easier to find what's blocking you.
You're using a lot of libraries here that probably aren't necessary.
You use both REXML
and Nokogiri
. They both do the same job. Except Nokogiri
is much better at it (benchmark).
Instead of storing data at index
in 15 arrays, have one set of hashes.
For instance,
items = Set.new
doc.xpath('//a/@href').each do |url|
item = {}
item[:url] = url.content
items << item
end
items.each do |item|
xml = Nokogiri::XML(open(item[:url]))
item[:id] = xml.root['id']
...
end
Now that you have your items
set, you can iterate over it and write to the file. This is much faster than doing it line by line.
In your original code, you have the same thing repeated a dozen times. Instead of copying and pasting, try instead to abstract out the common code.
xmldoc.elements.each("/educationInfo/titles/title[1] | /ns:educationInfo/ns:titles/ns:title[1]"){
|e| m = e.text
m = m.to_s
next if m.empty?
@titleSv << m
}
Move what's common to a method
def get_value(xml, path)
str = ''
xml.elements.each(path) do |e|
str = e.text.to_s
next if str.empty?
end
str
end
And move anything constant to another hash
xml_paths = {
:title_sv => "/educationInfo/titles/title[1] | /ns:educationInfo/ns:titles/ns:title[1]",
:title_en => "/educationInfo/titles/title[2] | /ns:educationInfo/ns:titles/ns:title[2]",
...
}
Now you can combine these techniques to make for much cleaner codes
item[:title_sv] = get_value(xml, xml_paths[:title_sv])
item[:title_en] = get_value(xml, xml_paths[:title_en])
I hope this helps!