Search code examples
rubyweb-crawlerweb-scrapingnokogiripopen

Sinew (ruby web scraper) example does not work on my machine


I'm trying to run the sample from the sinew source code, but it's not working on my machine. Here is the sample (taken directly from their github):

get "http://www.amazon.com/gp/bestsellers/books/ref=sv_b_3"
noko.css(".zg_itemRow").each do |item|
  row = { }
  row[:url] = item.css(".zg_title a").first[:href]
  row[:title] = item.css(".zg_title")
  row[:img] = item.css(".zg_itemImage_normal img").first[:src]
  csv_emit(row)
end

I'm using ubuntu 12.04 with ruby 1.9.3 and rvm. Here is what I typed in, followed by the error.

jefferton@ubuntu:~/IdeaProjects/sinew_scrape$ sinew sell_list.sinew
curl http://www.amazon.com/gp/bestsellers/books/ref=sv_b_3
/home/jefferton/.rvm/gems/ruby-1.9.3-head/gems/sinew-1.0.2/lib/sinew/text_util.rb:48:in `popen': No such file or directory - tidy -asxml  -bare  -quiet  -utf8  -wrap 0 --doctype omit --hide-comments yes --force-output yes -f /dev/null (Errno::ENOENT)
from /home/jefferton/.rvm/gems/ruby-1.9.3-head/gems/sinew-1.0.2/lib/sinew/text_util.rb:48:in `html_tidy'
from /home/jefferton/.rvm/gems/ruby-1.9.3-head/gems/sinew-1.0.2/lib/sinew/main.rb:33:in `html'
from /home/jefferton/.rvm/gems/ruby-1.9.3-head/gems/sinew-1.0.2/lib/sinew/main.rb:59:in `noko'
from sell_list.sinew:9:in `_run'
from /home/jefferton/.rvm/gems/ruby-1.9.3-head/gems/sinew-1.0.2/lib/sinew/main.rb:121:in `instance_eval'
from /home/jefferton/.rvm/gems/ruby-1.9.3-head/gems/sinew-1.0.2/lib/sinew/main.rb:121:in `_run'
from /home/jefferton/.rvm/gems/ruby-1.9.3-head/gems/sinew-1.0.2/lib/sinew/main.rb:16:in `initialize'
from /home/jefferton/.rvm/gems/ruby-1.9.3-head/gems/sinew-1.0.2/bin/sinew:19:in `new'
from /home/jefferton/.rvm/gems/ruby-1.9.3-head/gems/sinew-1.0.2/bin/sinew:19:in `block in <top (required)>'
from /home/jefferton/.rvm/gems/ruby-1.9.3-head/gems/sinew-1.0.2/bin/sinew:18:in `each'
from /home/jefferton/.rvm/gems/ruby-1.9.3-head/gems/sinew-1.0.2/bin/sinew:18:in `<top (required)>'
from /home/jefferton/.rvm/gems/ruby-1.9.3-head/bin/sinew:19:in `load'
from /home/jefferton/.rvm/gems/ruby-1.9.3-head/bin/sinew:19:in `<main>'
from /home/jefferton/.rvm/gems/ruby-1.9.3-head/bin/ruby_noexec_wrapper:14:in `eval'
from /home/jefferton/.rvm/gems/ruby-1.9.3-head/bin/ruby_noexec_wrapper:14:in `<main>'

I wish I knew a more specific thing to ask, but I'm not sure what to do here.

Thanks.


Solution

  • That library might be worth looking into but I can't imagine why they would use curl over mechanize or what html tidy is supposed to be for. And shelling out to executables like that is just a bad approach. My opinion is to avoid it and use mechanize instead.