Search code examples
rubyformsscreen-scrapingmechanize

In Mechanize (Ruby), how to login then scrape?


My aim: On ROR 3, get a PDF file from a site which requires you to login before you can download it

My method, using Mechanize:

Step 1: log in Step 2: since I'm logged in, get the PDF link

Thing is, when I debug and click on the link scraped, I'm redirected to the login page instead of getting the file

There are the 2 controls that I did on step 1:

(...)
search_results = form.submit
puts search_results.body

=> {"succes":true,"URL":"/sso/inscription/"} Apparently the login succeed

puts agent.cookie_jar.jar

=> I could find the information about my session, si I guess that cookies are saved

Any hint about what I did wrong ? (could be important: on the site, when you login into "http://elwatan.com/sso/inscription/inscription_payant.php", you are redirected to the home page (elwatan.com)

Below my code:

# step 1, login:
agent = Mechanize.new
page = agent.get("http://elwatan.com/sso/inscription/inscription_payant.php")

form = page.form_with(:id => 'form-login-page')
form.login = "my_mail"
form.password = "my_pasword"
search_results = form.submit

# step 2, get the PDF:
@watan = {}
page.parser.xpath('//th/a').each do |link|
puts @watan[link.text.strip] = link['href']

end

Solution

  • The agent variable retains the session and cookies.

    So you first do your login, as you did, and then you write agent.get(---your-pdf-link-here--).

    In your example code is a small error: the result of the submit is in search_results and then you continue to use page to search for the links?

    So in your case, I guess it should look like (untested of course) :

    # step 1, login:
    agent = Mechanize.new
    agent.pluggable_parser.pdf = Mechanize::FileSaver
    
    page = agent.get("http://elwatan.com/sso/inscription/inscription_payant.php")
    
    form = page.form_with(:id => 'form-login-page')
    form.login = "my_mail"
    form.password = "my_pasword"
    page = form.submit
    
    # step 2, get the PDF:
    page.parser.xpath('//th/a').each do |link|
      agent.get link['href']
    end