Search code examples
perlpattern-matchingfasta

How do I speed up pattern recognition in perl


This is the program as it stands right now, it takes in a .fasta file (a file containing genetic code), creates a hash table with the data and prints it, however, it is quite slow. It splits a string an compares it against all other letters in the file.

use strict;
use warnings;
use Data::Dumper;

my $total = $#ARGV + 1;
my $row;
my $compare;
my %hash;
my $unique = 0;
open( my $f1, '<:encoding(UTF-8)', $ARGV[0] ) or die "Could not open file '$ARGV[0]' $!\n";

my $discard = <$f1>;
while ( $row = <$f1> ) {
    chomp $row;
    $compare .= $row;
}
my $size = length($compare);
close $f1;
for ( my $i = 0; $i < $size - 6; $i++ ) {
    my $vs = ( substr( $compare, $i, 5 ) );
    for ( my $j = 0; $j < $size - 6; $j++ ) {
        foreach my $value ( substr( $compare, $j, 5 ) ) {
            if ( $value eq $vs ) {
                if ( exists $hash{$value} ) {
                    $hash{$value} += 1;
                } else {
                    $hash{$value} = 1;
                }
            }
        }
    }
}
foreach my $val ( values %hash ) {
    if ( $val == 1 ) {
        $unique++;
    }
}

my $OUTFILE;
open $OUTFILE, ">output.txt" or die "Error opening output.txt: $!\n";
print {$OUTFILE} "Number of unique keys: " . $unique . "\n";
print {$OUTFILE} Dumper( \%hash );
close $OUTFILE;

Thanks in advance for any help!


Solution

  • It is not clear from the description what is wanted from this script, but if you're looking for matching sets of 5 characters, you don't actually need to do any string matching: you can just run through the whole sequence and keep a tally of how many times each 5-letter sequence occurs.

    use strict;
    use warnings;
    use Data::Dumper;
    
    my $str; # store the sequence here
    my %hash;
    # slurp in the whole file
    open(IN, '<:encoding(UTF-8)', $ARGV[0]) or die "Could not open file '$ARGV[0]' $!\n";
    while (<IN>) {
        chomp;
        $str .= $_;
    }
    close(IN);
    
    # not sure if you were deliberately omitting the last two letters of sequence
    # this looks at all the sequence
    my $l_size = length($str) - 4;
    for (my $i = 0; $i < $l_size; $i++) {
        $hash{ substr($str, $i, 5) }++;
    }
    
    # grep in a scalar context will count the values.
    my $unique = grep { $_ == 1 } values %hash;
    
    open OUT, ">output.txt" or die "Error opening output.txt: $!\n";
    print OUT "Number of unique keys: ". $unique."\n";
    print OUT Dumper(\%hash);
    close OUT;