Search code examples
perlfasta

Get the header lines of protein sequences that start with specific amino acid in FASTA


Hi guys so I have been trying to use PERL to print only the headers (the entire >gi line) of protein sequences that start with "MAD" or "MAN" (the first 3 aa) from a FASTA file. But I couldn't figure out which part went wrong. Thanks in advance!

#!usr/bin/perl
use strict;

my $in_file = $ARGV[0];
open( my $FH_IN, "<", $in_file );    ###open to fileholder
my @lines = <$FH_IN>;
chomp @lines;
my $index = 0;

foreach my $line (@lines) {
    $index++;
    if ( substr( $line, 0, 3 ) eq "MAD" or substr( $line, 0, 3 ) eq "MAN" ) {
        print "@lines [$index-1]\n\n";
    } else {
        next;
    }
}

This is a short part of the FASTA file, the header of the first seq is what I am looking for

>gi|16128078|ref|NP_414627.1| UDP-N-acetylmuramoyl-L-alanyl-D-glutamate:meso-diaminopimelate ligase [Escherichia coli str. K-12 substr. MG1655] MADRNLRDLLAPWVPDAPSRALREMTLDSRVAAAGDLFVAVVGHQADGRRYIPQAIAQGVAAIIAEAKDE ATDGEIREMHGVPVIYLSQLNERLSALAGRFYHEPSDNLRLVGVTGTNGKTTTTQLLAQWSQLLGEISAV MGTVGNGLLGKVIPTENTTGSAVDVQHELAGLVDQGATFCAMEVSSHGLVQHRVAALKFAASVFTNLSRD HLDYHGDMEHYEAAKWLLYSEHHCGQAIINADDEVGRRWLAKLPDAVAVSMEDHINPNCHGRWLKATEVN

Solution

  • Since you want to know how to improve your code, here is a commented version of your program with some suggestions on how you could change it.

    #!/usr/bin/perl
    use strict;
    

    You should also add the use warnings pragma, which enables warnings (as you might expect).

    my $in_file = $ARGV[0];
    

    It's a good idea to check that $ARGV[0] is defined, and to give an appropriate error message if it isn't, e.g.

    my $in_file = $ARGV[0] or die "Please supply the name of the FASTA file to process";
    

    If $ARGV[0] is not defined, Perl executes the die statement.

    open( my $FH_IN, "<", $in_file );  # open to fileholder
    

    You should check that the script is able to open the input file; you can use a similar structure to the previous statement, by adding a die statement:

    open( my $FH_IN, "<", $in_file ) or die "Could not open $in_file: $!";
    

    The special variable $! holds the error message as to why the file could not be opened (e.g. it doesn't exist, no permission to read it, etc.).

    my @lines = <$FH_IN>;
    chomp @lines;
    my $index = 0;
    
    foreach my $line (@lines) {
        $index++;
        if ( substr( $line, 0, 3 ) eq "MAD" or substr( $line, 0, 3 ) eq "MAN" ) {
             print "@lines [$index-1]\n\n";
    

    This is the problem point in the script. Firstly, the correct way to access an item in the array is using $lines[$index-1]. Secondly, the first item in an array is at index 0, so line 1 of the file will be at position 0 in @lines, line 4 at position 3, etc. Because you've already incremented the index, you're printing the line after the header line. The problem can easily be fixed by incrementing $index at the end of the loop.

        }
        else {
           next;
        }
    

    Using next isn't really necessary here because there is no code following the else statement, so there's nothing to gain from telling Perl to skip the rest of the loop.

    The fixed code would look like this:

    #!/usr/bin/perl
    use warnings;
    use strict;
    
    my $in_file = $ARGV[0] or die "Please supply the name of the FASTA file to be processed";
    open( my $FH_IN, "<", $in_file ) or die "Could not open $in_file: $!";
    my @lines = <$FH_IN>;
    chomp @lines;
    
    my $index = 0;
    foreach my $line (@lines) {
        if ( substr( $line, 0, 3 ) eq "MAD" or substr( $line, 0, 3 ) eq "MAN" ) {
            print "$lines[$index-1]\n\n";
        }
        $index++;
    }
    

    I hope that is helpful and clear!