Search code examples
ruby-on-railsrubyformsscreen-scrapingmechanize-ruby

Login automatically to get a scraped file on Rails app with Mechanize


To login then to download a PDF file, I have a code that works perfectly fine on ruby when I Debug. Problem is, when I try to use this code on a Rails app with an instance variable, I can't download the file, guess that it's a cookie issue but I didn't achieve to resolve it

here the code that works on Ruby (i can download the PDF file, so the login is a success):

require 'rubygems'
require 'mechanize'

agent = Mechanize.new

agent.pluggable_parser.pdf = Mechanize::FileSaver

page = agent.get("http://elwatan.com/sso/inscription/inscription_payant.php")

# login to the site
form = page.form_with(:id => 'form-login-page')
form.login = "my_login"
form.password = "my_password"
page = form.submit

#get the PDF link    
agent.get("http://elwatan.com/").parser.xpath('//div[2]/div/p/a/@href|/img').each do       |link|
agent.get link['href']
end

And below my attempt on Ruby On Rails 3, didn't work (I can scrape the link, but not downloading the file because I am getting redirected to the login page:

Controller.rb

@agent = Mechanize.new
@agent.user_agent_alias = 'Mac Safari'
@page = @agent.get("http://elwatan.com/sso/inscription/inscription_payant.php")

# login
form = @page.form_with(:id => 'form-login-page')
form.login = "my_login"
form.password = "my_password"
@page = form.submit

# get the PDF link
@watan = {}
@agent.get("http://elwatan.com/").parser.xpath('//div[2]/div/p/a/@href|/img').each do     |link|
@watan[link.text.strip] = @agent.get link['href']
end

View.rb

<% if @watan %>
<% @watan.each do |key, value| %>
<a href="http://www.elwatan.com<%= "#{key}" %>" target='_blank'>download my file</a>
<% end %>
<% end %>

Solution

  • This will be a long post.

    First off, you should place your scraping code in a libary, so create the file lib/watan_scraper.rb and fill it with

    module WatanScraper
    
      def self.get_all_pdfs
        agent = get_agent
        # get the PDF link
    
        watan = []
        agent.get("http://elwatan.com/").parser.xpath('//div[2]/div/p/a/@href|/img').each do     |link|
          watan << link.text.strip
        end
        watan
      end
    
    
      def self.get_single_pdf(link_text)
        agent = get_agent
        # get the PDF link
    
        found_link= nil
        agent.get("http://elwatan.com/").parser.xpath('//div[2]/div/p/a/@href|/img').each do     |link|
          if link.text.strip = link_text 
            found_link = link['href']
          end 
        end
    
        pdf = 
        if found_link
          # fetch pdf
          agent.get(found_link)
        end
      end
    
    
      private
    
      def get_agent
        agent = Mechanize.new
        agent.user_agent_alias = 'Mac Safari'
        page = agent.get("http://elwatan.com/sso/inscription/inscription_payant.php")
    
        # login
        form = page.form_with(:id => 'form-login-page')
        form.login = "my_login"
        form.password = "my_password"
        form.submit
    
        agent
      end
    
    end
    

    Ok, and now you can write in your controller

    class PdfsController < ApplicationController
      def index
        @watan = WatanScraper.get_all_pdfs
      end
    
      def show
        pdf_name = params[:id]
        @pdf = WatanScraper.get_pdf(pdf_name)
        send_data @pdf, :filename => "#{padf_name}.pdf"
      end
    end 
    

    Your view should be in file views/pdfs/index.html.haml (let's use haml

    - @watan.each do |link_text| 
      = link_to "Download #{link_text}", pdf_path(link_text)
    

    Your routes should be as follows (config/routes.rb)

    resources :pdfs, only: [:index, :show]
    

    This code is of course untested, but this at least is nicely structured and will fetch the pdf in the right session (using mechanize) and then sends it back to the browser.