Search code examples
perlauthenticationuser-agentmojolicious

Replace authenticated user agent login/ page scrape using Perl and Mojolicious


I am trying to port some old web scraping scripts written using older Perl modules to work using only Mojolicious.

Have written a few basic scripts with Mojo but am puzzled on an authenticated login which uses a secure login site and how this should be handled with a Mojo::UserAgent script. Unfortunately the only example I can see in the documentation is for basic authentication without forms.

The Perl script I am trying to convert to work with Mojo:UserAgent is as follows:

#!/usr/bin/perl

use LWP;
use LWP::Simple;
use LWP::Debug qw(+);
use LWP::Protocol::https;
use WWW::Mechanize;
use HTTP::Cookies;

# login first before navigating to pages
# Create our automated browser and set up to handle cookies
my $agent = WWW::Mechanize->new();
$agent->cookie_jar(HTTP::Cookies->new());
$agent->agent_alias( 'Windows IE 6' );  #tell the website who we are (old!)

# get login page

$agent->get("https://reg.mysite.com")
$agent->success or die $agent->response->status_line;

# complete the user name and password form
$agent->form_number (1);
$agent->field (username => "user1");
$agent->field (password => "pass1");
$agent->click();

#try to get member's only content page from main site on basis we are now "logged in" 
$agent->get("http://www.mysite.com/memberpagesonly1");
$agent->success or die $agent->response->status_line;

$member_page = $agent->content();
print "$member_page\n";

So the above works fine. How to convert to do the same job in Mojolicious?


Solution

  • Mojolicious is a web application framework. While Mojo::UserAgent works well as a low-level HTTP user agent, and provides facilities that are unavailble from LWP (in particular native support for asynchronous requests and IPV6) neither are as convenient to use as as WWW::Mechanize for web scraping.

    WWW::Mechanize subclasses LWP::UserAgent to interface with the internet, and uses HTML::Form to process the forms it finds. Mojo::UserAgent has no facility for processing HTML forms, and so building the corresponding HTTP requests is not at all straighforward. Information such as the HTTP method used (GET or POST) the names of the form fields, and the insertion of default values for hidden fields are all done automatically by HTML::Form and are left to the programmer if you restrict yourself to Mojo::UserAgent.

    It seems to me that even trying to use Mojo::UserAgent in combination with HTML::Form is poblematic, as the former requires a Mojo::Transaction::HTTP object to represent the submission of a filled-in form, whereas the latter generates HTTP::Request objects for use with LWP.

    In short, unless you are willing to largely rewrite WWW::Mechanize, I think there is no way to reimplement your software using Mojolicious modules.