I have recently noticed that a quick script I had written in Perl that was designed to be used on sub 10MB files has been modified, re-tasked and used in 40MB+ text files with significant performance issues in a batch environment.
The jobs have been running for about 12 hours per run when encountering a large text file and I am wondering how do I improve the perfomance of the code? Should I slurp the file into memory and if I do it will break the jobs reliance on the line numbers in the file. Any constructive thought would be greatly appreciated, I know the job is looping through the file too many times but how to reduce that?
#!/usr/bin/perl
use strict;
use warnings;
my $filename = "$ARGV[0]"; # This is needed for regular batch use
my $cancfile = "$ARGV[1]"; # This is needed for regular batch use
my @num =();
open(FILE, "<", "$filename") || error("Cannot open file ($!)");
while (<FILE>)
{
push (@num, $.) if (/^P\|/)
}
close FILE;
my $start;
my $end;
my $loop = scalar(@num);
my $counter =1;
my $test;
open (OUTCANC, ">>$cancfile") || error ("Could not open file: ($!)");
#Lets print out the letters minus the CANCEL letters
for ( 1 .. $loop )
{
$start = shift(@num) if ( ! $start );
$end = shift(@num);
my $next = $end;
$end--;
my $exclude = "FALSE";
open(FILE, "<", "$filename") || error("Cannot open file ($!)");
while (<FILE>)
{
my $line = $_;
$test = $. if ( eof );
if ( $. == $start && $line =~ /^P\|[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]\|1I\|IR\|/)
{
print OUTCANC "$line";
$exclude = "TRUECANC";
next;
}
if ( $. >= $start && $. <= $end && $exclude =~ "TRUECANC")
{
print OUTCANC "$line";
} elsif ( $. >= $start && $. <= $end && $exclude =~ "FALSE"){
print $_;
}
}
close FILE;
$end = ++$test if ( $end < $start );
$start = $next if ($next);
}
#Lets print the last letter in the file
my $exclude = "FALSE";
open(FILE, "<", "$filename") || error("Cannot open file ($!)");
while (<FILE>)
{
my $line = $_;
if ( $. == $start && $line =~ /^P\|[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]\|1I\|IR\|/)
{
$exclude = "TRUECANC";
next;
}
if ( $. >= $start && $. <= $end && $exclude =~ "TRUECANC")
{
print OUTCANC "$line";
} elsif ( $. >= $start && $. <= $end && $exclude =~ "FALSE"){
print $_;
}
}
close FILE;
close OUTCANC;
#----------------------------------------------------------------
sub message
{
my $m = shift or return;
print("$m\n");
}
sub error
{
my $e = shift || 'unknown error';
print("$0: $e\n");
exit 0;
}
There are some things that could speed the script up, like removing unneccessary regex usage.
/^P\|/
is equivalent to "P|" eq substr $_, 0, 2
.$foo =~ "BAR"
could be -1 != index $foo, "BAR"
.Then there is some repeated code. Factoring that out into a sub will not increase performance per se, but makes it easier to reason about the behaviour of the script.
There are a lot of unneccessary stringifications like "$filename"
– $filename
alone is sufficient.
But the worst offender would be this:
for ( 1 .. $loop ) {
...
open FILE, "<", $filename or ...
while (<FILE>) {
...
}
...
}
You only need to read that file in once, preferably into an array. You can the loop over the indices:
for ( 1 .. $loop ) {
...
for my $i (0 .. $#file_contents) {
my $line = $file_contents[$i];
... # swap $. for $i, but avoid off-by-one error
}
...
}
Disk IO is slow, so cache where you can!
I also see that you are using the $exclude
variable as a boolean with the values FALSE
and TRUECANC
. Why not 0
and 1
, so you can use it directly in a conditional?
You can factor out common tests in if/elsif:
if (FOO && BAR) { THING_A }
elsif (FOO && BAZ) { THING_B }
should be
if (FOO) {
if (BAR) { THING_A }
elsif (BAZ) { THING_B }
}
The $. == $start && $line =~ /^P\|.../
test may be silly, because $start
contains only the numbers of lines that start with P|
– so the regex may be sufficient here.
If I have understood the script correctly then the following should yield a significant performance increase:
#!/usr/bin/perl
use strict;
use warnings;
my ($filename, $cancfile) = @ARGV;
open my $fh, "<", $filename or die "$0: Couldn't open $filename: $!";
my (@num, @lines);
while (<$fh>)
{
push @lines, $_;
push @num, $#lines if "P|" eq substr $_, 0, 2;
}
open my $outcanc, ">>", $cancfile or die "$0: Couldn't open $cancfile: $!";
for my $i ( 0 .. $#num )
{
my $start = $num[$i];
my $end = ($num[$i+1] // @lines) - 1;
# pre v5.10:
# my $end = (defined $num[$i+1] ? $num[$i+1] : @lines) - 1
if ($lines[$start] =~ /^P[|][0-9]{9}[|]1I[|]IR[|]/) {
print {$outcanc} @lines[$start .. $end];
} else {
print STDOUT @lines[$start .. $end];
}
}
The script is cleaned up. The file is cached in an array. Only the parts of the array are iterated that are actually needed – we are down to O(n) from the previous O(n · m).
For your future scripts: Proving behaviour around loops and mutating variables is not impossible, but tedious and annoying. Realizing that
for (1 .. @num) {
$start = shift @num unless $next; # aka "do this only in the first iteration"
$next = shift @num:
$end = $next - 1:
while (<FH>) {
...
$test = $. if eof
...
}
$end = ++test if $end < $start;
$start = $next if $next;
}
is actually all about circumventing a possible undef
in the 2nd shift
takes some time. Instead of testing for eof
in the inner loop, we can just pick the line number after the loop, so we don't need $test
. Then we get:
$start = shift @num;
for my $i (1 .. @num) {
$end = $num[$i] - 1:
while (<FH>) { ... }
$end = $. + 1 if $end < $start; # $end < $start only true if not defined $num[$i]
$start = $num[$i] if $num[$i];
}
After translating $i
down by one we confine the out-of-bounds problem to one point only:
for my $i (0 .. $#num) {
$start = $num[$i];
$end = $num[$i+1] - 1; # HERE: $end = -1 if $i == $#num
while (<FH>) { ... }
}
$end = $. + 1 if $end < $start;
After replacing the file reading with an array (careful, there is a difference of one between the array index and the line number), we see that the final file reading loop can be avoided if we pull that iteration into the for
loop, because we know how many lines there are in total. So to say, we do
$end = ($num[$i+1] // $last_line_number) - 1;
Hopefully my cleaned up code is indeed equivalent to the original.