ruby-on-rails ruby forms screen-scraping mechanize-ruby

Login automatically to get a scraped file on Rails app with Mechanize

To login then to download a PDF file, I have a code that works perfectly fine on ruby when I Debug. Problem is, when I try to use this code on a Rails app with an instance variable, I can't download the file, guess that it's a cookie issue but I didn't achieve to resolve it

here the code that works on Ruby (i can download the PDF file, so the login is a success):

require 'rubygems'
require 'mechanize'

agent = Mechanize.new

agent.pluggable_parser.pdf = Mechanize::FileSaver

page = agent.get("http://elwatan.com/sso/inscription/inscription_payant.php")

# login to the site
form = page.form_with(:id => 'form-login-page')
form.login = "my_login"
form.password = "my_password"
page = form.submit

#get the PDF link    
agent.get("http://elwatan.com/").parser.xpath('//div[2]/div/p/a/@href|/img').each do       |link|
agent.get link['href']
end

And below my attempt on Ruby On Rails 3, didn't work (I can scrape the link, but not downloading the file because I am getting redirected to the login page:

Controller.rb

@agent = Mechanize.new
@agent.user_agent_alias = 'Mac Safari'
@page = @agent.get("http://elwatan.com/sso/inscription/inscription_payant.php")

# login
form = @page.form_with(:id => 'form-login-page')
form.login = "my_login"
form.password = "my_password"
@page = form.submit

# get the PDF link
@watan = {}
@agent.get("http://elwatan.com/").parser.xpath('//div[2]/div/p/a/@href|/img').each do     |link|
@watan[link.text.strip] = @agent.get link['href']
end

View.rb

<% if @watan %>
<% @watan.each do |key, value| %>
<a href="http://www.elwatan.com<%= "#{key}" %>" target='_blank'>download my file</a>
<% end %>
<% end %>

Solution

This will be a long post.

First off, you should place your scraping code in a libary, so create the file lib/watan_scraper.rb and fill it with

module WatanScraper

  def self.get_all_pdfs
    agent = get_agent
    # get the PDF link

    watan = []
    agent.get("http://elwatan.com/").parser.xpath('//div[2]/div/p/a/@href|/img').each do     |link|
      watan << link.text.strip
    end
    watan
  end


  def self.get_single_pdf(link_text)
    agent = get_agent
    # get the PDF link

    found_link= nil
    agent.get("http://elwatan.com/").parser.xpath('//div[2]/div/p/a/@href|/img').each do     |link|
      if link.text.strip = link_text 
        found_link = link['href']
      end 
    end

    pdf = 
    if found_link
      # fetch pdf
      agent.get(found_link)
    end
  end


  private

  def get_agent
    agent = Mechanize.new
    agent.user_agent_alias = 'Mac Safari'
    page = agent.get("http://elwatan.com/sso/inscription/inscription_payant.php")

    # login
    form = page.form_with(:id => 'form-login-page')
    form.login = "my_login"
    form.password = "my_password"
    form.submit

    agent
  end

end

Ok, and now you can write in your controller

class PdfsController < ApplicationController
  def index
    @watan = WatanScraper.get_all_pdfs
  end

  def show
    pdf_name = params[:id]
    @pdf = WatanScraper.get_pdf(pdf_name)
    send_data @pdf, :filename => "#{padf_name}.pdf"
  end
end

Your view should be in file views/pdfs/index.html.haml (let's use haml

- @watan.each do |link_text| 
  = link_to "Download #{link_text}", pdf_path(link_text)

Your routes should be as follows (config/routes.rb)

resources :pdfs, only: [:index, :show]

This code is of course untested, but this at least is nicely structured and will fetch the pdf in the right session (using mechanize) and then sends it back to the browser.