Search code examples
regexperllwp-useragent

Why does LWP::Simple::get slow down a subsequent regular expression?


While benchmarking using a perl regex on a string

If I get the string using a shell call it works great. But If I get using LWP::Simple the subsequent regex get slowed down With lwp the regex takes 13s with wget it works is less than 4s

Why ?

#!/usr/bin/perl                                                                                                                                                                                      
use Time::HiRes qw( gettimeofday tv_interval );
use strict;
my %data;

$data{'TO'} = "rcpt";
$data{'MESSAGE_ID'} = "37";
$data{'ID'} = "7";
$data{'UNIQID'} = "cff47534-fe6b-c45a-7058-8301adf1b97";
$data{'XOR'} = "abcdef";

my $url = "http://raw.githubusercontent.com/ramprasadp/hostedtexfiles/master/msg2.txt";

#
# This makes the rest of the program very slow LWP::Simple
#
my $msg_string = LWP::Simple::get($url);                  


# While this works great                                                                                                                                                              
#my $msg_string = `wget -q -O - $url`;

my $start = [gettimeofday];
for (my $j=0;$j<50000; $j++) {
    my $tmp_string = $msg_string;
    $tmp_string =~ s/\$\{ (\w+) \}/$data{$1}/g;
}
print "Time taken in ms is " . 1000 * tv_interval ( $start )."\n";

Solution

  • It's fast because it's wrong. You are using the substitution operator s/// here to work on text strings. get produces a text string, so that's alright. `wget …` produces a buffer of octets. Despite that, your substitution operation happens to show the correct result, but that's by dumb luck and coincidence, and will not work in the general case. It could break when characters with high codepoints come into play, and you accidentally substitute parts of them on the octet level.

    You can verify what I say is true by using Encode to either encode the get text string to octets and both results will be fast and wrong, or decode the `wget …` octets into a text string and both results will be correct and slow.

    Read https://p3rl.org/UNI for an introduction to the topic.