Search code examples
performanceperlparsingfastq

improve speed of parsing fastq


@solved C# with the same code is twice as fast

i am parsing a phred33 fastq file in perl and it is taking a considerable amount of time (on the order of 15 minutes). The fastq file is about 3 gigs. Are there any reasonable ways to make this faster?

$file=shift;
open(FILE,$file);
open(FILEFA,">".$file.".fa");
open(FILEQA,">".$file.".qual");
while($line=<FILE>)
{
    chomp($line);
    if($line=~m/^@/)
    {


    $header=$line;
    $header =~ s/@/>/g;
    $seq=<FILE>;
    chomp($seq);
    $nothing=<FILE>;
    $nothing="";
    $fastq=<FILE>;

    print FILEFA $header."\n";
    print FILEFA $seq."\n";
    $seq="";
    print FILEQA $header."\n";

        @elm=split("",$fastq);
        $i=0;
        while(defined($elm[$i]))
        {
            $Q = ord($elm[$i]) - 33;
            if($Q!="-23")
            {
            print FILEQA $Q." ";
            }
            $i=$i+1;
        }
        print FILEQA "\n";
    }
}
print $file.".fa\n";
print $file.".qual\n";

Solution

  • There's next to no CPU being used here. it's IO bound, so it's mostly the time to read through 3GB. There are micro-optimisations (and other cleanups) that can be done.

    First, always use use strict; use warnings;.

    The main code is

    my @elm = split(//, $fastq);
    my $i=0;
    while(defined($elm[$i])) {
        my $Q = ord($elm[$i]) - 33;
        if($Q!="-23") {
            print FILEQA $Q." ";
        }
        $i=$i+1;
    }
    

    The purpose of if($Q!="-23") is to check if the character is a newline, which you wouldn't have to do if you did chomp($fastq);. (What's with the quotes around -23?!)

    chomp($fastq);
    my @elm = split(//, $fastq);
    my $i=0;
    while(defined($elm[$i])) {
        my $Q = ord($elm[$i]) - 33;
        print FILEQA $Q." ";
        $i=$i+1;
    }
    print FILEQA "\n";
    

    Using a while loop just complicates things. Use a for loop when you have a known number of iterations.

    chomp($fastq);
    for (split(//, $fastq)) {
        print FILEQA (ord($_)-33)." ";
    }
    print FILEQA "\n";
    

    It might help a bit to turn that inside out.

    chomp($fastq);
    print FILEQA join(' ', map ord($_)-33, split //, $fastq), "\n";
    

    On second thought, not inside out enough :)

    $fastq =~ s/(.)/(ord($1)-33) . " "/eg;
    print FILEQA $fastq;
    

    But what it we precalculated the translations? Then we wouldn't have to call a sub (the /e code) repeatedly.

    my %map = map { chr($_) => ($_-33)." " } 0x00..0xFF;
    
    $fastq =~ s/(.)/$map{$1}/g;
    print FILEQA $fastq;
    

    After a bit more cleanup, we get:

    use strict;
    use warnings;
    
    my %map = map { chr($_) => ($_-33)." " } 0x00..0xFF;
    
    my $file = shift;
    
    my $fa_file   = "$file.fa";
    my $qual_file = "$file.qual";
    
    open(my $FILE,   '<', $file     ) or die $!;
    open(my $FILEFA, '>', $fa_file  ) or die $!;
    open(my $FILEQA, '>', $qual_file) or die $!;
    
    while (my $header = <$FILE>) {
        next if $header !~ /^@/;
    
        my $seq = <$FILE>;
        <$FILE>;
        my $fastq = <$FILE>;
    
        $header =~ s/@/>/g;
        $fastq =~ s/(.)/$map{$1}/g;
    
        print $FILEFA $header;
        print $FILEFA $seq;
    
        print $FILEQA $header;
        print $FILEQA $fastq;
    }
    
    print "$fa_file\n";
    print "$qual_file\n";