Search code examples
unixawkgrepextractfasta

Extract FASTA sequences (with version number) using sequence IDs (without version number) listed in txt file


I would like to extract specific sequences from myfile.fasta based on the ids listed in transcript_id.txt file. My main problem is that my transcript_id.txt file only lists transcripts ids while fasta file also has transcript versions and transcripts listed in transcript_id.txt can have multiple versions in fasta file. I have tried several approach (listed below) but couldn't get what I need.

myfile.fasta

>transcriptA.t1
ATGGAAGTAGGAAGTGGTTTGGGGAATGATGAACGACGCACCTGGCCAGCCGATGTTCTA
GCAGCAGCAGATGCCTACGTTGTCCTCGCTGCAGAGTACAACCACAGCCTTCCCCCGGCA
AATCAAAGCATCCTAAATGAAGCATGTGTGACAGCAGGATCATCTGCCAGGGTTTCCAAA
CTCACCAACCTAATGGACCACTTCTTCTCGTACAAGTATCGTCCATCTGGCATTGTCTGC
TTGGCTGTTGAGACTTATGTCTGCTGGGGAGCTGAAGCACGCCTACATCTTTCACGCCTT
>transcriptA.t2
GTGCCAAATTGGGATCTGGAGAAACCTGCAGCTTTTGATCTTACTGTTGCATCACTGCTT
TTCATCAAGAAG
>transcriptB.t1
GGAGTGGTGAGTCTTCTGCTCACTGCACTAAACCTTAATGACACTGGGACCTACAGCTGC
GTTGATGTGCCTCAATCGATTGATGAATTTGCTCGGCGTCATCCTCGGCGATTGCAATTA
GTAGATATTCTTAACGATTGA
>transcriptC.t1
AAAAGGCTCTGGGAGTTTGAGGCCAACGGGGGGGGAGGCCCCATTACCTCCAGCGTCAGG
TTCGCTGATGTGTACAACGATGGAACCCTAGACGTGATCTTTGTCACCCTCACCGGAACC
TTCTGGGTCCTGGAGGGGCTCACTGGA
>transcriptD.t1
CAAAGGAAGCATGCCTCTAATGATGCTAAGTGCTCAGAGCTAGGTTGGTCATGCATACCA
GCCATGGGAGACCCGTCCCCATCCATCCAATGGAGCTCCACTCCGCAACTTTAG
>transcriptD.t2
ATGCCTCAGGTGAATGTGGCCCCACCCACTGCCAAGGTCAAGGGGGCGTGTAGGGTTGTA
TGTGGTGCGTTGTCTCTCCAACAATTCATTATGCCCGACCAAGAGGTGTCACCTGTTCAG
CAAGGAGAATCTGACCATTTGCACATTGAAGCTTTCACTCTGGTGTCTGGAGGCACGAGT
ACGGATGTCGTAACTTTGCTGCAAGAGCAATACAAAGTGCTAAGCTGA
>transcriptE.t1
TCTATTCCAGTAGTCTACAAGGCACTGACCCCTGAGGGAGTGCCATCCAACAAGGAAGAT
GTCATTGGAGTGGTGCCAGACAAGGTTGTTGGAACACCAGATGTGAAGCCAACTGGAAGT
GTAGCTGCTGCTGCTGCCTTGGGCGTGTGCAAAGTGACTCT

transcript_id.txt

transcriptA
transcriptC
transcriptD

Goal

>transcriptA.t1
ATGGAAGTAGGAAGTGGTTTGGGGAATGATGAACGACGCACCTGGCCAGCCGATGTTCTA
GCAGCAGCAGATGCCTACGTTGTCCTCGCTGCAGAGTACAACCACAGCCTTCCCCCGGCA
AATCAAAGCATCCTAAATGAAGCATGTGTGACAGCAGGATCATCTGCCAGGGTTTCCAAA
CTCACCAACCTAATGGACCACTTCTTCTCGTACAAGTATCGTCCATCTGGCATTGTCTGC
TTGGCTGTTGAGACTTATGTCTGCTGGGGAGCTGAAGCACGCCTACATCTTTCACGCCTT
>transcriptA.t2
GTGCCAAATTGGGATCTGGAGAAACCTGCAGCTTTTGATCTTACTGTTGCATCACTGCTT
TTCATCAAGAAG
>transcriptC.t1
AAAAGGCTCTGGGAGTTTGAGGCCAACGGGGGGGGAGGCCCCATTACCTCCAGCGTCAGG
TTCGCTGATGTGTACAACGATGGAACCCTAGACGTGATCTTTGTCACCCTCACCGGAACC
TTCTGGGTCCTGGAGGGGCTCACTGGA
>transcriptD.t1
CAAAGGAAGCATGCCTCTAATGATGCTAAGTGCTCAGAGCTAGGTTGGTCATGCATACCA
GCCATGGGAGACCCGTCCCCATCCATCCAATGGAGCTCCACTCCGCAACTTTAG
>transcriptD.t2
ATGCCTCAGGTGAATGTGGCCCCACCCACTGCCAAGGTCAAGGGGGCGTGTAGGGTTGTA
TGTGGTGCGTTGTCTCTCCAACAATTCATTATGCCCGACCAAGAGGTGTCACCTGTTCAG
CAAGGAGAATCTGACCATTTGCACATTGAAGCTTTCACTCTGGTGTCTGGAGGCACGAGT
ACGGATGTCGTAACTTTGCTGCAAGAGCAATACAAAGTGCTAAGCTGA

Tried:

https://bioinformaticsworkbook.org/dataWrangling/fastaq-manipulations/retrieve-fasta-sequences-using-sequence-ids.html#gsc.tab=0

1) ncbi-blast/makeblastdb

makeblastdb -in myfile.fasta -parse_seqids -dbtype nucl
blastdbcmd -entry "lcl|transcriptA.t1" -db myfile.fasta -target_only

This partially worked but I could only search it by typing in exact transcript version and adding "lcl|". Didn't manage to use wildcard or transcript_id.txt.

https://www.biostars.org/p/319099/

2) grep

grep -w -A 2 -f transcript_id.txt myfile.fasta --no-group-separator

This works great but I have to set -A to some number and numbers of lines each transcript has varies.

3) Linearising fasta file

awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' < myfile.fasta > linear.fasta
awk '/^>transcriptA/' linear.fasta

Again I don't know how to search linearised fasta file with transcript_id.txt only one by one using awk.

4) seqkit

This only worked if I *add transcript version to transcript_id.txt. Any attempt to use --id-regexp failed.

seqkit grep -n -f transcript_id*.txt myfile.fasta

Solution

  • 1st solution: Could you please try following. Written and tested with shown samples in GNU awk.

    awk '
    FNR==NR{
      arr[$0]
      next
    }
    /^>/{
      found=""
      match($0,/^>[^.]*/)
      if(substr($0,RSTART+1,RLENGTH-1) in arr){
        found=1
      }
    }
    found
    ' transcript_id.txt myfile.fasta
    

    2nd solution: With using multiple field delimiters option.

    awk '
    FNR==NR{
      arr[$0]
      next
    }
    /^>/{
      found=""
      if($2 in arr){ found=1 }
    }
    found
    ' transcript_id.txt FS="[>.]" myfile.fasta
    


    Explanation of 1st solution:

    awk '                              ##Starting awk program from here.
    FNR==NR{                           ##Checking condition FNR==NR which will be TRUE when transcript_id.txt file is being read.
      arr[$0]                          ##Creating arr with index of current line.
      next                             ##next will skip all further statements from here.
    }
    /^>/{                              ##Checking condition if line starts from > then do following.
      found=""                         ##Nullifying found here.
      match($0,/^>[^.]*/)              ##using match function to match regex starting from > to before dot occured in current line.
      if(substr($0,RSTART+1,RLENGTH-1) in arr){ ##Checking condition if sub string which we get after above succesul matched regex is present in arr
        found=1                        ##Then setting found to 1 here.
      }
    }
    found                              ##Checking condition if found is SET then print that line.
    ' transcript_id.txt myfile.fasta   ##Mentioning Input_file names here.
    

    Explanation of 2nd solution:

    awk '                         ##Starting awk program from here.
    FNR==NR{                      ##Checking condition FNR==NR which will be TRUE when transcript_id.txt is being read.
      arr[$0]                     ##Creating arr with index of current line here.
      next                        ##next will skip all further statements from here.
    }
    /^>/{                         ##Checking condition if line starts from > then do following.
      found=""                    ##Nulifying found here.
      if($2 in arr){ found=1 }    ##Checking condition if 2nd field is present in arr then set found.
    }
    found                         ##Checking condition if found is set then print line.
    ' transcript_id.txt FS="[>.]" myfile.fasta  ##Mentioning Input_file names and setting FS as > or dot for Input_file myfile.fasta here.