Search code examples
perlwhile-loopcontrol-flow

perl do something once in a while loop


I am frequently working with biological sequence data (FASTAs) that have the following format, where a leading left angle bracket is used as a delimiter to indicate a new sequence header. These files often have text wrapping (except for the headers):

>header1
ACTGACTGACTGACTG
ACTGACTGACTGACTG
>header2
CTGGGACTAGGGGGAG
CTGGGACTAGGGGGAG

Often, I want to avoid reading the entire file into memory because it can be many MBs (sometimes GBs), so I try to focus on while loops and reading line by line. However, I find myself often needing to add extra code to do something unique at the top or bottom of the file. For instance, today I wanted to remove the text wrapping of some file, which seemed as simple as:

while (my $line = <$inputfasta_fh>) {
    chomp($line);
    if ($line =~ /^>/) {
        print $outputfasta_fh "$line\n";
    }
    else {
        print $outputfasta_fh $line;
    }
}

But, I realized I need a newline before all the headers except the first one (otherwise they would be concatenated to end of the the previous sequence). So, this is my crude work-around.

my $switch = 0;
while (my $line = <$inputfasta_fh>) {
    chomp($line);
    if ($line =~ /^>/) {
        if ($switch == 1) {
            print $outputfasta_fh "\n";
        }
        print $outputfasta_fh "$line\n";
        $switch = 1;
    }
    else {
        print $outputfasta_fh $line;
    }
}

Previously, I've had other issues where I need to do something with the last line. For example, I had a script that would read a fasta, store each header and then start counting its sequence length (again line-by-line), and if it was within a range I specified, I saved it to another file. The counting would abort if the length went past the maximum, but I wouldn't know if it was more than the minimum until I reached another header or the end of the file. In the latter case, I had to repeat the length-checking subroutine below the while loop. I would like to avoid repeating that last part.

my $length = 0;
my $header;
my @line_array;

while (my $line = <$inputfasta_fh>) {
    chomp($line);
    if ($line =~ /^>/) {
        # check if previous sequence had a length within range
        if (check_length($length, $minlength, $maxlength) == 1) {
            print $outputfasta_fh "$header\n";
            print $outputfasta_fh join ("\n", @line_array), "\n";
        }
        undef @line_array;
        $header = $line;
        $length = 0;
    }
    else {
        if ($length <= $maxlength) { # no point in measuring any more
            push (@linearray, $line);
            $length += length($line);
        }
    }
}

#and now for the last sequence
if (check_length($length, $minlength, $maxlength) == 1) {
    print $outputfasta_fh "$header\n";
    print $outputfasta_fh join ("\n", @line_array), "\n";
}

sub check_length {
    my ($length, $minlength, $maxlength) = @_;
    if (($length >= $minlength) && ($length <= $maxlength)) {
        return 1;
    }
    else {
        return 0;
    }
}

So, my basic question is how to indicate I want to do something once in a loop without resorting to counters or repeating code outside the loop? Thanks for any help!


Solution

  • Here are solutions to the 2 problems you described. They are solved using modules from the BioPerl distribution. In this case the Bio::SeqIO module to open files and the Bio::Seq module for some methods it provides, (length, width). You can see how they simplify the solutions!

    #!/usr/bin/perl
    use strict;
    use warnings;
    use Bio::SeqIO;
    
    my $in  = Bio::SeqIO->new( -file   => "input1.txt" ,
                               -format => 'fasta');
    my $out = Bio::SeqIO->new( -file   => '>test.dat',
                               -format => 'fasta');
    
    while ( my $seq = $in->next_seq() ) {
        $out->width($seq->length); # sequence on 1 line.
        $out->write_seq($seq);
    }
    
    my ($minlen, $maxlen) = (40, 1000);
    
    while ( my $seq = $in->next_seq() ){
        my $len = $seq->length;
        out->write_seq($seq) if $minlen <= $len && $len <= $maxlen;
    }
    

    It would be worth your while to look into the modules - as you can see from these 2 examples, the resultant code is much more concise and easier to follow. You could look around the BioPerl wiki. The HOWTOs give some examples that you can use right away.