Search code examples
rubyweb-scrapingnokogirimechanizeopen-uri

Ruby Madness Downloading same file with Nokogiri, Mechanize and OpenUri to get different information


Ok,

I am writing the ubiquitous crawler and have run into some problems. ~Not surprising being a total noob at Ruby.

I use Nokogiri to get the html of a page - find all the links in it that I am interested in and then download the files associated with those links. All is well so far.

However, I don't seem to be able to get the information I need from a single method.

If I use file = open(Src).read then file contains the contents of the file - which is great for saving to a database, and for hashing purposes. But it doesn't give me easy access (as far as I have found) to attributes such as filename, size, file type etc.

To get that information I am using Mechanize like this:

agent = Mechanize.new
fop = agent.get(Src)

Using the head agent.head method I can get the content-type, last-modified date, and content-length. fop.filename gives me the filename of course. Now using the agent.head(Src)["content-type"] method is, I think re-downloading the information so for the content-type, last-modified and content-length calls - it is downloading the head 3 times. A total waste I would say as file already contains the complete file and fop should provide me with all the other information I need without calling head.

So is there a better way of doing this (from the thumb nail downloader)

thumbs.each do |thumb|
  imgSrc = thumb.css('.t_img').first['src']
  file = open(imgSrc).read 
  agent = Mechanize.new
  fop = agent.get(imgSrc)
  p fop
  puts "1 Driver        : prowl.rb" 
  puts "1 Source        : " + pageURL
  puts "1 Title         : " + thumb.css('.t_img').first['alt']
  puts "1 File Source   : " + imgSrc
  puts "1 File Type     : " + agent.head(imgSrc)["content-type"].to_s
  puts "1 File Name     : " + fop.filename
  puts "1 Last Modified : " + agent.head(imgSrc)["last-modified"].to_s
  puts "1 Image Size    : " + agent.head(imgSrc)["content-length"].to_s
  puts "1 MD5           : " + GetMD5(*[file.to_s])
  puts "1 SHA256        : " + GetSha256(*[file.to_s])
end 

So the question is:

  1. How can I optimise my crawler so that I can get all the information I want with the minimum number of requests? and,

Solution

  • agent = Mechanize.new
    thumbs.each do |thumb|
      imgUrl = thumb.css('.t_img').first['src']
      imgTitle = thumb.css('.t_img').first['alt']
      image = agent.get(imgSrc)
      p image
      puts "1 Driver        : prowl.rb"
      puts "1 Source        : " + pageURL
      puts "1 Title         : " + imgTitle
      puts "1 File Source   : " + imgUrl
      puts "1 File Type     : " + image.header['content-type']
      puts "1 File Name     : " + image.filename
      puts "1 Last Modified : " + image.header["last-modified"]
      puts "1 Image Size    : " + image.header["content-length"]
      puts "1 MD5           : " + GetMD5(*[image.content.to_s])
      puts "1 SHA256        : " + GetSha256(*[image.content.to_s])
    end
    

    Here it is. Reuse the agent, there is no point in creating a new one every time.

    Get the page directly from Mechanize, no nead to open and read then pass the content around. All the header information you are looking for is in the header attribute of your page.