Search code examples
regexperlbioinformaticsprotein-database

How can I find a protein sequence from a FASTA file using perl?


So I have an exercise in which I have to print the three first lines of a fasta file as well as the protein sequence. I have tried to run a script I wrote, but cygwin doesnt seem to print the sequence out. My code is as follows:

#!usr/bin/perl
open (IN,'P30988.txt');
while (<IN>) {
    if($_=~ m/^ID/) {
        print $_ ;
    }
    if($_=~ m/^AC/) {
        print $_ ;
    }
    if ($_=~ m/^SQ/) {
        print $_;
    }
    if ($_=~ m/\^s+(\w+)/) { #this is the part I have trouble with
        $a.=$1;
        $a=~s/\s//g; #this is for removing the spaces inside the sequence
        print $a;
    }

The fast file looks like this:

SQ   SEQUENCE   474 AA;  55345 MW;  0D9FA81230B282D9 CRC64;
     MRFTFTSRCL ALFLLLNHPT PILPAFSNQT YPTIEPKPFL YVVGRKKMMD AQYKCYDRMQ
     QLPAYQGEGP YCNRTWDGWL CWDDTPAGVL SYQFCPDYFP DFDPSEKVTK YCDEKGVWFK
     HPENNRTWSN YTMCNAFTPE KLKNAYVLYY LAIVGHSLSI FTLVISLGIF VFFRSLGCQR
     VTLHKNMFLT YILNSMIIII HLVEVVPNGE LVRRDPVSCK ILHFFHQYMM ACNYFWMLCE
     GIYLHTLIVV AVFTEKQRLR WYYLLGWGFP LVPTTIHAIT RAVYFNDNCW LSVETHLLYI
     IHGPVMAALV VNFFFLLNIV RVLVTKMRET HEAESHMYLK AVKATMILVP LLGIQFVVFP
     WRPSNKMLGK IYDYVMHSLI HFQGFFVATI YCFCNNEVQT TVKRQWAQFK IQWNQRWGRR
     PSNRSARAAA AAAEAGDIPI YICHQELRNE PANNQGEESA EIIPLNIIEQ ESSA
//

To match the sequence I used the fact that each line starts with several spaces and then its only letters. It doesnt seem to do the trick regarding cygwin. Here is the link for the sequence https://www.uniprot.org/uniprot/P30988.txt


Solution

  • The problem is with this line

    if ($_=~ m/\^s+(\w+)/) { #this is the part I have trouble with
    

    You have the backslash in the wrong place in this part \^s+. You are actually escaping the ^. The line in your code should be

    if ($_=~ m/^\s+(\w+)/) { #this is the part I have trouble with
    

    I'd write that block of code like this

    if ($_=~ m/^\s/) { 
        s/\s+//g; #this is for removing the spaces inside the sequence
        print $_;
    }