Search code examples

problems in a ruby screen-scraping script

I have a small crawler/screen-scraping script that used to work half a year ago, but now, it doesnt work anymore. I checked the html and css values for the reg expression in the page source, but they are still the same, so from this point of view, it should work. Any guesses?

require "open-uri"

# output file
f = open 'results.csv', 'w+'

# output string
results = ""


  # crawl first 20 pages
  for i in (1..20)
    open("http://www.example-#{i}.com") {|url|

      # check each line using regular expression
      url.each_line { |line|
        if line =~ /class=\"L1g\" onclick=\"s_objectID=\'foobar\'\">([^<]+)<\/a><\/h3><\/li>/
          # if regular expression matches then add to results
          results += $1 + "\n"
  # write to and close file
  f.print results


  • The target website would appear to have changed the structure of their page so your Regex no longer matches.

    This is a good example of why you should not scrape pages using Regex to match content. Try reworking your script using a DOM parser like Nokogiri. This will not necessarily stop your script from breaking but will at least allow it to survive minor changes.

    The reason it is not working can be seen in this Rubular link