Ok,
I am writing the ubiquitous crawler and have run into some problems. ~Not surprising being a total noob at Ruby.
I use Nokogiri to get the html of a page - find all the links in it that I am interested in and then download the files associated with those links. All is well so far.
However, I don't seem to be able to get the information I need from a single method.
If I use file = open(Src).read
then file contains the contents of the file - which is great for saving to a database, and for hashing purposes. But it doesn't give me easy access (as far as I have found) to attributes such as filename, size, file type etc.
To get that information I am using Mechanize like this:
agent = Mechanize.new
fop = agent.get(Src)
Using the head agent.head method I can get the content-type, last-modified date, and content-length. fop.filename
gives me the filename of course. Now using the agent.head(Src)["content-type"]
method is, I think re-downloading the information so for the content-type, last-modified and content-length calls - it is downloading the head 3 times. A total waste I would say as file already contains the complete file and fop should provide me with all the other information I need without calling head.
So is there a better way of doing this (from the thumb nail downloader)
thumbs.each do |thumb|
imgSrc = thumb.css('.t_img').first['src']
file = open(imgSrc).read
agent = Mechanize.new
fop = agent.get(imgSrc)
p fop
puts "1 Driver : prowl.rb"
puts "1 Source : " + pageURL
puts "1 Title : " + thumb.css('.t_img').first['alt']
puts "1 File Source : " + imgSrc
puts "1 File Type : " + agent.head(imgSrc)["content-type"].to_s
puts "1 File Name : " + fop.filename
puts "1 Last Modified : " + agent.head(imgSrc)["last-modified"].to_s
puts "1 Image Size : " + agent.head(imgSrc)["content-length"].to_s
puts "1 MD5 : " + GetMD5(*[file.to_s])
puts "1 SHA256 : " + GetSha256(*[file.to_s])
end
So the question is:
agent = Mechanize.new
thumbs.each do |thumb|
imgUrl = thumb.css('.t_img').first['src']
imgTitle = thumb.css('.t_img').first['alt']
image = agent.get(imgSrc)
p image
puts "1 Driver : prowl.rb"
puts "1 Source : " + pageURL
puts "1 Title : " + imgTitle
puts "1 File Source : " + imgUrl
puts "1 File Type : " + image.header['content-type']
puts "1 File Name : " + image.filename
puts "1 Last Modified : " + image.header["last-modified"]
puts "1 Image Size : " + image.header["content-length"]
puts "1 MD5 : " + GetMD5(*[image.content.to_s])
puts "1 SHA256 : " + GetSha256(*[image.content.to_s])
end
Here it is. Reuse the agent, there is no point in creating a new one every time.
Get the page directly from Mechanize, no nead to open and read then pass the content around. All the header information you are looking for is in the header
attribute of your page.