Search code examples
perlbioinformaticsfasta

How do I merge two FASTA files (one file with line break) in Perl?


I have two following Fasta file:

file1.fasta

>0
GAATAGATGTTTCAAATGTACCAATTTCTTTCGATT
>1
GTTAAGTTATATCAAACTAAATATACATACTATAAA
>2
GGGGCTGTGGATAAAGATAATTCCGGGTTCGAATAC

file2.qual

>0
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40
40 40 40 40 40 40 40 40 15 40 40
>1
40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 40 40 40
40 40 40 40 40 40 40 40 40 40 40
>2
40 40 40 40 7 40 40 5 40 40 40 40 40 40 40 40 37 13 31 20 15 40 10 11 4
40 8 3 29 10 19 18 40 19 15 5

Note the line break in "qual" file for each fasta header - marked with ">". Number of file header ('>') is the same for both files. Number of numerical qualities = sequence length.

What I want to do is to append this two files yielding:

GAATAGATGTTTCAAATGTACCAATTTCTTTCGATT  40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 15 40 40
GTTAAGTTATATCAAACTAAATATACATACTATAAA  40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 40 20 40 40 40 40 40 40 40 40 40 40 40 40 40 40
GGGGCTGTGGATAAAGATAATTCCGGGTTCGAATAC  40 40 40 40 7 40 40 5 40 40 40 40 40 40 40 40 37 13 31 20 15 40 10 11 4 40 8 3 29 10 19 18 40 19 15 5

But somehow my code below fail to do it correctly? Especially the second line of each entry in 'qual' file doesn't get printed.

use strict;
use Data::Dumper;        
use Carp;
use File::Basename;      

my $fastafile = $ARGV[0] || "reads/2039F.2.fasta"; 
my $base      = basename( $fastafile, ".fasta" );
my $qualfile  = "reads/" . $base . ".qual";
print "$qualfile\n";

open SEQ, '<', $fastafile or die $!; #Seq
open PRB, '<', $qualfile or die $!; #quality


while (my $seq = <SEQ>) {
     my $qual = <PRB>;
     chomp($seq);
     chomp($qual);

     if ($seq =~ /^>/ || $qual =~ /^>/) {
         next;
     }
     else {
         print "$seq\t$qual\n";      
     }

}

What's the correct way to do it?


Solution

  • You're missing the 2nd (and every subsequent) line of the quality scores and would also miss additional sequence lines. For this and code re-use purposes, the way to handle FASTA sequences is as whole entries/records:

    local $/ = "\n>";
    while (my $seq = <SEQ>) {
         my $qual = <PRB>;
         chomp($seq);  $seq =~ s/^>*.+\n//;  $seq =~ s/\n//g;
         chomp($qual);  $qual =~ s/^>*.+\n//;  $qual =~ s/\n/ /g;
    
         print "$seq\t$qual\n";      
    
    }
    

    You could also easily capture the FASTA header in the first replace.