Search code examples
perlwebformsscreen-scrapinghtml-content-extraction

What's the best way to write a maintainable web scraping app?


I wrote a perl script a while ago which logged into my online banking and emailed me my balance and a mini-statement every day. I found it very useful for keeping track of my finances. The only problem is that I wrote it just using perl and curl and it was quite complicated and hard to maintain. After a few instances of my bank changing their webpage I got fed up of debugging it to keep it up to date.

So what's the best way of writing such a program in such a way that it's easy to maintain? I'd like to write a nice well engineered version in either Perl or Java which will be easy to update when the bank inevitably fiddle with their web site.


Solution

  • In Perl, something like WWW::Mechanize can already make your script more simple and robust, because it can find HTML forms in previous responses from the website. You can fill in these forms to prepare a new request. For example:

    my $mech = WWW::Mechanize->new();
    $mech->get($url);
    $mech->submit_form(
        form_number => 1,
        fields      => { password => $password },
    );
    die unless ($mech->success);