Search code examples
ruby-on-railsrubyactiverecordrakerakefile

Rake task not saving or creating new record in database


I've created a ruby script that executes fine if I run it from Console.

The script fetches some information from various websites and saves it to my database table.

However, when I want to turn the code into a rake task, the code still runs, but it does not save any new records. I don't get any errors from the rake either.

# Add your own tasks in files placed in lib/tasks ending in .rake,
# for example lib/tasks/capistrano.rake, and they will automatically be           available to Rake.

require File.expand_path('../config/application', __FILE__)

Rails.application.load_tasks

require './crawler2.rb'
task :default => [:crawler]

task :crawler do

### ###

require 'rubygems'
require 'nokogiri'
require 'open-uri'

start = Time.now

$a = 0

sites = ["http://www.nytimes.com","http://www.news.com"]

for $a in 0..sites.size-1

url = sites[$a] 

$i = 75

$error = 0

avoid_these_links = ["/tv", "//www.facebook.com/"]

doc = Nokogiri::HTML(open(url))

    links = doc.css("a")
    hrefs = links.map {|link| link.attribute('href').to_s}.uniq.sort.delete_if {|href| href.empty?}.delete_if {|href| avoid_these_links.any? { |w| href =~ /#{w}/ }}.delete_if {|href| href.size < 10 }

#puts hrefs.length

#puts hrefs

for $i in 0..hrefs.length
    begin

        #puts hrefs[60] #for debugging)

    #file = open(url)
    #doc = Nokogiri::HTML(file) do

        if hrefs[$i].downcase().include? "http://"

            doc = Nokogiri::HTML(open(hrefs[$i]))

        else 

            doc = Nokogiri::HTML(open(url+hrefs[$i]))

        end 

        image = doc.at('meta[property="og:image"]')['content']
        title = doc.at('meta[property="og:title"]')['content']
        article_url = doc.at('meta[property="og:url"]')['content']
        description = doc.at('meta[property="og:description"]')['content']
        category = doc.at('meta[name="keywords"]')['content']

        newspaper_id = 1 


        puts "\n"
        puts $i
        #puts "Image: " + image
        #puts "Title: " + title
        #puts "Url: " + article_url
        #puts "Description: " + description
        puts "Catory: " + category

            Article.create({ 
            :headline => title, 
            :caption => description, 
            :thumbnail_url => image, 
            :category_id => 3, 
            :status => true, 
            :journalist_id => 2, 
            :newspaper_id => newspaper_id, 
            :from_crawler => true,
            :description => description,
            :original_url => article_url}) unless Article.exists?(original_url: article_url)

        $i +=1

        #puts $i #for debugging

        rescue
        #puts "Error here: " + url+hrefs[$i] if $i < hrefs.length
        $i +=1    # do_something_* again, with the next i
        $error +=1

    end 

end

puts "Page: " + url
puts "Articles: " + hrefs.length.to_s
puts "Errors: " + $error.to_s

$a +=1

end

finish = Time.now

diff = ((finish - start)/60).to_s

puts diff + " Minutes"


### ###


end

The code executes fine, if I save the file as crawler.rb and open it in Console by doing --> " load './crawler2.rb' ". When I use the exact same code in a rake task, I get no new records.


Solution

  • I figured out what was wrong.

    I need to remove:

    require './crawler2.rb'
    task :default => [:crawler]
    

    and instead edit the following:

    task :crawler => :environment do
    

    Now the crawler runs every ten minutes with a bit of help from Heroku scheduler :-)

    Thanks for the help guys - and sorry for the bad formatting. Hope this answer may help others.