Search code examples
bashfilterbioinformaticsfasta

Filter a .fasta file


I have the following two files:

  1. A .fasta file with >700 protein sequences. Each sequence occupies a few lines, and has a header with some information about the protein, including an accession code (e.g. A0A2K5BUY4). Here is an extract from the file:
>A0A2K5RDW4|unreviewed|Olfactory receptor|taxID:2715852
MAHINCTQVAEFILVGLTDHQELKMPLFVLFLSIYLFTVVGNLGLILLIRADSRLSTPMYFFLSNLAFVDFCYSSVITPK
MLGTFLYKQNVISFNACATQLCCFLTFMISECLLLASIAYDRYVAICNPLLYMVVMSPGICIQLVAVPYSYSFLMALFHT
ILTFRLSYCHSNIVNHFYCDDMPLLRLTCSDTSSKQLWIFACAGIMFTSSLLIVFVSYMFIISAILRMRSAEGRQKAFST
CGSHMLAVTIFYGTLIFMYLQPSSNHALDTDKMASVFYTVIIPMLNPLIYSLRNKEVKEALKKVIINKNQ
>A0A2K5RFA5|unreviewed|Olfactory receptor|taxID:2715852
MSRWSNGSSNVSYTSFLLAGFPGLRESRALLVLPFLSLYLMIIFTNALVIHTVLTQQSLHQPMYLLIALLLAVNICATST
VVPAMLFSFSTHFNCISLARCLIQMFCIYFLIVFDCNILLVMALDRYVAICYPLCYPEIVTGQLLAGLVGAAATRSTCIV
APVVFLASRVHFCRSDMIQHFACEHMALMKLSCGDISLNKTVGLTVRIFNRVLDMLLLGTSYSHIIHAAFRISSGRARSK
ALNTCGSHLLVIFTVYLSTMSSSIVYRTAHTASQDVHNLLSTFYLLLPCLVNPIIYGTRTKEIRYHLGEQFLSSGPRDTM
MSI
  1. A .txt file with 229 lines, each one containing one accession code, like the ones in the header of the sequences in the .fasta file. Here is an extract from this file:
A0A2K5BUY4
A0A2K5BUY7
A0A2K5BUZ1
A0A2K5BW26
A0A2K5BW36
A0A2K5BWA0

I want to create a second .fasta file, that contains only the sequences (including headers) whose accession codes are in the .txt file.

I am new to bioinformatics and to programming in general, so I did the following with help from chatGPT:

# Use pcregrep to extract sequences that match the accession codes
pcregrep -M -f platyrrhini_or_accessions.txt platyrrhini_784ORs.fasta | \
  pcregrep -M ">([^>]+)(\n[^>]+)*" > filtered_platyrrhini_ORs.fasta

but this did not catch the sequences, as the output contained only the headers of the matching proteins:

>A0A2R8PLQ0|unreviewed|Olfactory receptor|taxID:9483
>A0A5F4VRG8|unreviewed|Olfactory receptor|taxID:9483
>A0A5F4VRT9|unreviewed|Olfactory receptor|taxID:9483
>A0A5F4VS80|unreviewed|Olfactory receptor|taxID:9483

I also tried with grep, but it didn't work because the sequences do not occupy a fixed number of lines:

grep -A 1 -f platyrrhini_or_accessions.txt platyrrhini_784ORs.fasta > filtered_platyrrhini_ORs.fasta

This captured the 229 headers but only 1 line from each sequence.

And lastly I used awk:

awk 'BEGIN {FS="|"} NR==FNR {accessions[$1]; next} 
     {if (substr($0, 1, 1) == ">") {seq_id = substr($0, 2); seq = $0} 
      else {seq = seq "\n" $0} 
      if (seq_id in accessions) {print seq_id "\n" seq}}' \
platyrrhini_or_accessions.txt platyrrhini_784ORs.fasta > filtered_platyrrhini_ORs.fasta

and the output file was empty.

I would really appreciate any help.


Solution

  • Using awk for this is a good idea but there are several issues with your code:

    • With seq_id = substr($0, 2) you assign not just the ID but the whole header minus the leading >.

    • With accessions[$1] the index you add to array accessions is the ID without the leading >. But the value of seq_id in seq_id in accessions has the leading >.

    • You decide to print while processing the header (inside the if (substr($0, 1, 1) == ">" {...}), so the rest of the sequence is not yet in your seq variable.

    • ...

    If the header lines always start with a > and no other lines start with a >, you can try:

    awk -F '|' '
      NR == FNR {accessions[">" $1]; next}
      /^>/ {to_print = $1 in accessions}
      to_print' platyrrhini_or_accessions.txt platyrrhini_784ORs.fasta
    

    awk statements are pattern-action pairs applied to the input records, where the action is executed if and only if the pattern is true for the current record. In your case records are regular lines.

    • The pattern of the first statement (NR == FNR) is true only for the lines of the first file. Its action (accessions[">" $1]; next) stores the ID found in platyrrhini_or_accessions.txt with an added leading > as index of the accessions array and moves to the next line (next), such that the 2 other statements are skipped.

    • The pattern of the second statement (/^>/) is a regular expression and evaluates as true for all lines with a leading >. The action (to_print = $1 in accessions) assigns true to variable to_print if the first field of the line, that is, the ID with the leading >, is an index of the accessions array, else false.

    • The third statement has only the pattern part (to_print), no action part. So it applies the default action (print) to the current line if to_print is true, that is, if the last ID we encountered is in platyrrhini_or_accessions.txt.

    Note: in awk there is no real boolean type; 0 and the empty string are considered as false, anything else as true. With any POSIX-compliant awk, in the above code, the to_print variable takes value 0 (false) or a non-zero numeric value (true), likely 1 (but this is not specified by POSIX).