Search code examples
perlhttpsscreen-scrapingsessionidwww-mechanize

Why am I getting a new session ID on every page fetch in my Perl WWW::Mechanize script?


So I'm scraping a site that I have access to via HTTPS, I can login and start the process but each time I hit a new page (URL) the cookie Session Id changes. How do I keep the logged in Cookie Session Id?

#!/usr/bin/perl -w
use strict;
use warnings;
use WWW::Mechanize;
use HTTP::Cookies;
use LWP::Debug qw(+);
use HTTP::Request;
use LWP::UserAgent;
use HTTP::Request::Common;

my $un = 'username';
my $pw = 'password';

my $url = 'https://subdomain.url.com/index.do';

my $agent = WWW::Mechanize->new(cookie_jar => {}, autocheck => 0);
$agent->{onerror}=\&WWW::Mechanize::_warn;
$agent->agent('Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.3) Gecko/20100407 Ubuntu/9.10 (karmic) Firefox/3.6.3');
$agent->get($url);

$agent->form_name('form');
$agent->field(username => $un);
$agent->field(password => $pw);
$agent->click("Log In");

print "After Login Cookie: ";
print $agent->cookie_jar->as_string();
print "\n\n";

my $searchURL='https://subdomain.url.com/search.do';
$agent->get($searchURL);    

print "After Search Cookie: ";
print $agent->cookie_jar->as_string();
print "\n";

The output:

After Login Cookie: Set-Cookie3: JSESSIONID=367C6D; path="/thepath"; domain=subdomina.url.com; path_spec; secure; discard; version=0

After Search Cookie: Set-Cookie3: JSESSIONID=855402; path="/thepath"; domain=subdomain.com.com; path_spec; secure; discard; version=0

Also I think the site requires a CERT (Well in the browser it does), would this be the correct way to add it?

$ENV{HTTPS_CERT_FILE} = 'SUBDOMAIN.URL.COM'; ## Insert this after the use HTTP::Request...

Also for the CERT In using the first option in this list, is this correct?

X.509 Certificate (PEM)
X.509 Certificate with chain (PEM)
X.509 Certificate (DER)
X.509 Certificate (PKCS#7)
X.509 Certificate with chain (PKCS#7)

Solution

  • When your user-agent isn't doing something you think it should be doing, compare it's requests with that of an interactive browser. A Firefox plugin are handy for this sort of thing.

    You're probably missing part of the process that the server expects. You probably aren't logging in or interacting correctly, and that could be for all sorts of reasons. For instance, there might be JavaScript on the page that WWW::Mechanize isn't handling.

    When you can pinpoint what an interactive browser is doing that you are not, you'll know where you need to improve your script.

    In your script, you can also watch what is happening by turning on debugging in LWP, which Mech is built on:

     use LWP::Debug qw(+); 
    

    rjh already answered the certificate part of your question.