Search code examples
ruby-on-railsrubyweb-scraping

Playing with Scrapi in Rails 3.. getting Segmentation Fault error / Abort Trap


What I've done so far..

sudo gem install scrapi

sudo gem install tidy

This didn't work because it didn't have the libtidy.dylib

So I did this :

sudo port install tidy

sudo cp libtidy.dylib /Library/Ruby/Gems/1.8/gems/scrapi-1.2.0/lib/tidy/libtidy.dylib

Then I started following the simple railscast at : http://media.railscasts.com/videos/173_screen_scraping_with_scrapi.mov

Right after Mr. Bates finished the first save for scrapitest.rb , I tried to run this code :

require 'rubygems'
require 'scrapi'

scraper = Scraper.define do
  process "title", :page_name => :text
  result :page_name
end

uri = URI.parse("http://www.walmart.com/search/search-ng.do?search_query=lost+season+3&ic=48_0&search_constraint=0")
p scraper.scrape(uri)

With this code :

ruby scrapitest.rb

And it returned this error :

/Library/Ruby/Gems/1.8/gems/tidy-1.1.2/lib/tidy/tidybuf.rb:39: [BUG] Segmentation fault
ruby 1.8.7 (2009-06-12 patchlevel 174) [universal-darwin10.0]

Abort trap

Completely out of ideas..


Solution

  • I had this issue and then a follow-up issue where a seg fault would happen non-deterministically.

    I followed the steps here - http://rubyforge.org/tracker/index.php?func=detail&aid=10007&group_id=435&atid=1744

    In tidy-1.1.2/lib/tidy/tidylib.rb:

    1. Add this line to the 'load' method in Tidylib:
    
      extern "void tidyBufInit(void*)"
    
    2. Define a new method called 'buf_init' in Tidylib:
    
      # tidyBufInit, using default allocator
      #
      def buf_init(buf)
        tidyBufInit(buf)
      end
    

    Then, in tidy-1.1.2/lib/tidy/tidybuf.rb:

    3. Add this line to the initialize method of Tidybuf below the malloc:
    
       Tidylib.buf_init(@struct)
    

    so that is looks like this:

    
      # tidyBufInit, using default allocator
      #
      def buf_init(buf)
        @struct = TidyBuffer.malloc
        Tidylib.buf_init(@struct)
      end
    
    4. For completeness, make Brennan's change by adding the allocator field to the TidyBuffer struct so that it looks like this:
    
      TidyBuffer = struct [
        "TidyAllocator* allocator",
        "byte* bp",
        "uint size",
        "uint allocated",
        "uint next"
      ]