I am frequently working with biological sequence data (FASTAs) that have the following format, where a leading left angle bracket is used as a delimiter to indicate a new sequence header. These files often have text wrapping (except for the headers):
>header1
ACTGACTGACTGACTG
ACTGACTGACTGACTG
>header2
CTGGGACTAGGGGGAG
CTGGGACTAGGGGGAG
Often, I want to avoid reading the entire file into memory because it can be many MBs (sometimes GBs), so I try to focus on while loops and reading line by line. However, I find myself often needing to add extra code to do something unique at the top or bottom of the file. For instance, today I wanted to remove the text wrapping of some file, which seemed as simple as:
while (my $line = <$inputfasta_fh>) {
chomp($line);
if ($line =~ /^>/) {
print $outputfasta_fh "$line\n";
}
else {
print $outputfasta_fh $line;
}
}
But, I realized I need a newline before all the headers except the first one (otherwise they would be concatenated to end of the the previous sequence). So, this is my crude work-around.
my $switch = 0;
while (my $line = <$inputfasta_fh>) {
chomp($line);
if ($line =~ /^>/) {
if ($switch == 1) {
print $outputfasta_fh "\n";
}
print $outputfasta_fh "$line\n";
$switch = 1;
}
else {
print $outputfasta_fh $line;
}
}
Previously, I've had other issues where I need to do something with the last line. For example, I had a script that would read a fasta, store each header and then start counting its sequence length (again line-by-line), and if it was within a range I specified, I saved it to another file. The counting would abort if the length went past the maximum, but I wouldn't know if it was more than the minimum until I reached another header or the end of the file. In the latter case, I had to repeat the length-checking subroutine below the while loop. I would like to avoid repeating that last part.
my $length = 0;
my $header;
my @line_array;
while (my $line = <$inputfasta_fh>) {
chomp($line);
if ($line =~ /^>/) {
# check if previous sequence had a length within range
if (check_length($length, $minlength, $maxlength) == 1) {
print $outputfasta_fh "$header\n";
print $outputfasta_fh join ("\n", @line_array), "\n";
}
undef @line_array;
$header = $line;
$length = 0;
}
else {
if ($length <= $maxlength) { # no point in measuring any more
push (@linearray, $line);
$length += length($line);
}
}
}
#and now for the last sequence
if (check_length($length, $minlength, $maxlength) == 1) {
print $outputfasta_fh "$header\n";
print $outputfasta_fh join ("\n", @line_array), "\n";
}
sub check_length {
my ($length, $minlength, $maxlength) = @_;
if (($length >= $minlength) && ($length <= $maxlength)) {
return 1;
}
else {
return 0;
}
}
So, my basic question is how to indicate I want to do something once in a loop without resorting to counters or repeating code outside the loop? Thanks for any help!
Here are solutions to the 2 problems you described. They are solved using modules from the BioPerl distribution. In this case the Bio::SeqIO module to open files and the Bio::Seq module for some methods it provides, (length, width). You can see how they simplify the solutions!
#!/usr/bin/perl
use strict;
use warnings;
use Bio::SeqIO;
my $in = Bio::SeqIO->new( -file => "input1.txt" ,
-format => 'fasta');
my $out = Bio::SeqIO->new( -file => '>test.dat',
-format => 'fasta');
while ( my $seq = $in->next_seq() ) {
$out->width($seq->length); # sequence on 1 line.
$out->write_seq($seq);
}
my ($minlen, $maxlen) = (40, 1000);
while ( my $seq = $in->next_seq() ){
my $len = $seq->length;
out->write_seq($seq) if $minlen <= $len && $len <= $maxlen;
}
It would be worth your while to look into the modules - as you can see from these 2 examples, the resultant code is much more concise and easier to follow. You could look around the BioPerl wiki. The HOWTOs give some examples that you can use right away.