Search code examples
jsonperlmojoliciousmojo

perl Mojo and JSON for simultaneous requests


I'm usually no Perl coder. However I've got to complete this task.

The following code works for me:

#!/usr/bin/perl

use LWP::UserAgent;
use JSON;
use strict;

my $md5 = $ARGV[0];
$md5 =~ s/[^A-Fa-f0-9 ]*//g;
die "invalid MD5" unless ( length($md5) == 32 );

my $ua = LWP::UserAgent->new(ssl_opts => { verify_hostname => 1 }, timeout => 10);
my $key="12345...7890";
my $url='https://www.virustotal.com/vtapi/v2/file/report';
my $response = $ua->post( $url, ['apikey' => $key, 'resource' => $md5] );
die "$url error: ", $response->status_line unless $response->is_success;
my $results=$response->content;

my $json = JSON->new->allow_nonref;
my $decjson = $json->decode( $results);

print "md5: ",$md5,"\n";
print "positives: ", $decjson->{"positives"}, "\n";
print "total: ", $decjson->{"total"}, "\n";
print "date: ", $decjson->{"scan_date"}, "\n";

Now I would like to recode the above for using asynchronous http using Mojo. I'm trying this:

#!/usr/bin/perl

use warnings;
use strict;
use Mojo;
use Mojo::UserAgent;

my $md5 = $ARGV[0];
$md5 =~ s/[^A-Fa-f0-9 ]*//g;
die "invalid MD5" unless ( length($md5) == 32 );

my ($vt_positives, $vt_scandate, $response_vt);
my $url='https://www.virustotal.com/vtapi/v2/file/report';
my $key="12345...7890";
my $ua = Mojo::UserAgent->new;
my $delay = Mojo::IOLoop->delay;

$ua->max_redirects(0)->connect_timeout(3)->request_timeout(6);
$ua->max_redirects(5);
$delay->begin;

$response_vt = $ua->post( $url => ['apikey' => $key, 'resource' => $md5] => sub {
    my ($ua, $tx) = @_;
    $vt_positives=$tx->res->json->{"positives"};
    print "Got response: $vt_positives\n";
    });

Mojo::IOLoop->start unless Mojo::IOLoop->is_running;

The first code is OK, the second isn't working. I must be doing something wrong when sending the request since I seem to get a 403 response (incorrect API usage). I also tried -> json calls but it didn't work out.

And even if I had done the request correctly, I'm not sure if I'm correctly decoding the json results with Mojo.

Help will be appreciated!


Solution

  • EDIT

    It seems that we missed the real question, how to post forms. Oops sorry about that.

    Posting forms depends on which version of Mojolicious you are using. Until recently (v3.85 -- 2013-02-13) there was a post_form method. On reflection however, it was decided there should either be *_form methods for every request type, or we should do something smarter, and thus the form generator was born.

    $response_vt = $ua->post( 
      $url, 
      form => {'apikey' => $key, 'resource' => $md5}, 
      sub { ... }
    );
    

    It can be added to any request method, making it much more consistent than the old form. Also note that it should be a hashref, not an arrayref as LWP allows. BTW there is also a json generator that works like this too, or you can even add your own!

    I'm leaving my original answer, showing non-blocking usage, which you may now amend given the above.

    ORIGINAL

    Building off the logic from creaktive, this is how I would start. The major difference is that there isn't a monitor watching to be sure that there are works going, rather when one finishes it checks to be sure that there are no idlers.

    I have also made some changes in the parsing logic, but nothing major.

    #!/usr/bin/env perl
    use Mojo::Base -strict;
    use utf8::all;
    
    use Mojo::URL;
    use Mojo::UserAgent;
    
    # FIFO queue
    my @urls = qw(
        http://sysd.org/page/1/
        http://sysd.org/page/2/
        http://sysd.org/page/3/
    );
    
    # User agent following up to 5 redirects
    my $ua = Mojo::UserAgent
        ->new(max_redirects => 5)
        ->detect_proxy;
    
    start_urls($ua, \@urls, \&get_callback);
    
    sub start_urls {
      my ($ua, $queue, $cb) = @_;
    
      # Limit parallel connections to 4
      state $idle = 4;
      state $delay = Mojo::IOLoop->delay(sub{say @$queue ? "Loop ended before queue depleated" : "Finished"});
    
      while ( $idle and my $url = shift @$queue ) {
        $idle--;
        print "Starting $url, $idle idle\n\n";
    
        $delay->begin;
    
        $ua->get($url => sub{ 
          $idle++; 
          print "Got $url, $idle idle\n\n"; 
          $cb->(@_, $queue); 
    
          # refresh worker pool
          start_urls($ua, $queue, $cb); 
          $delay->end; 
        });
    
      }
    
      # Start event loop if necessary
      $delay->wait unless $delay->ioloop->is_running;
    }
    
    sub get_callback {
        my ($ua, $tx, $queue) = @_;
    
        # Parse only OK HTML responses
        return unless 
            $tx->res->is_status_class(200)
            and $tx->res->headers->content_type =~ m{^text/html\b}ix;
    
        # Request URL
        my $url = $tx->req->url;
        say "Processing $url";
        parse_html($url, $tx, $queue);
    }
    
    sub parse_html {
        my ($url, $tx, $queue) = @_;
    
        state %visited;
    
        my $dom = $tx->res->dom;
        say $dom->at('html title')->text;
    
        # Extract and enqueue URLs
        $dom->find('a[href]')->each(sub{
    
            # Validate href attribute
            my $link = Mojo::URL->new($_->{href});
            return unless eval { $link->isa('Mojo::URL') };
    
            # "normalize" link
            $link = $link->to_abs($url)->fragment(undef);
            return unless grep { $link->protocol eq $_ } qw(http https);
    
            # Don't go deeper than /a/b/c
            return if @{$link->path->parts} > 3;
    
            # Access every link only once
            return if $visited{$link->to_string}++;
    
            # Don't visit other hosts
            return if $link->host ne $url->host;
    
            push @$queue, $link;
            say " -> $link";
        });
        say '';
    
        return;
    }