Search code examples
ruby-on-railsrubynokogirirufus-scheduler

What's the best way to schedule and execute repetitive tasks (like scraping a page for information) in Rails?


I'm looking for a solution which enables:

  1. Repetitive executing of a scraping task (nokogiri)
  2. Changing the time interval via http://www.myapp.com/interval (example)

What is the best solution/way to get this done?

Options I know about

  • Custom Rake task
  • Rufus Scheduler

Current situation

In ./config/initializers/task_scheduler.rb I have:

require 'nokogiri'
require 'open-uri'
require 'rufus-scheduler'
require 'rake'

scheduler = Rufus::Scheduler.new

scheduler.every "1h" do
    puts "BEGIN SCHEDULER at #{Time.now}"

    @url = "http://www.marktplaats.nl/z/computers-en-software/apple-ipad/ipad-mini.html?  query=ipad+mini&categoryId=2722&priceFrom=100%2C00&priceTo=&startDateFrom=always"
    @doc = Nokogiri::HTML(open(@url))
    @title = @doc.at_css("title").text

    @number = 0

    2.times do |number|
        @doc.css(".defaultSnippet.group-#{@number}").each do |listing|
            @listing_title = listing.at_css(".mp-listing-title").text
            @listing_subtitle = listing.at_css(".mp-listing-description").text
            @listing_price = listing.at_css(".price").text
            @listing_priority = listing.at_css(".mp-listing-priority-product").text

            listing = Listing.create(title: "#{@listing_title}", subtitle: "#{@listing_subtitle}", price: "#{@listing_price}")

        end

        @number +=1
    end

    puts "END SCHEDULER at #{Time.now}"
end

Is it not working?

Yes the current setup is working. However, I don't know how to enable changing the interval time via http://www.myapp.com/interval (example).

Changing scheduler.every "1h" to scheduler.every "#{@interval} do does not work.

In what file do I have to define @interval for it to work in task_scheduler.rb?


Solution

  • First off: your rufus scheduler code is in an initializer, which is fine, but it is executed before the rails process is started, and only when the rails process is started. So, in the initializer you have no access to any variable @interval you could set, for instance in a controller.

    What are possible options, instead of a class variable:

    • read it from a config file
    • read it from a database (but you will have to setup your own connection, in the initializer activerecord is not started imho

    And ... if you change the value you will have to restart your rails process for it to have effect again.

    So an alternative approach, where your rails process handles the interval of the scheduled job, is to use a recurring background job. At the end of the background, it reschedules itself, with the at that moment active interval. The interval is fetched from the database, I would propose. Any background job handler could do this. Check ruby toolbox, I vote for resque or delayed_job.