I have the following two files:
>A0A2K5RDW4|unreviewed|Olfactory receptor|taxID:2715852
MAHINCTQVAEFILVGLTDHQELKMPLFVLFLSIYLFTVVGNLGLILLIRADSRLSTPMYFFLSNLAFVDFCYSSVITPK
MLGTFLYKQNVISFNACATQLCCFLTFMISECLLLASIAYDRYVAICNPLLYMVVMSPGICIQLVAVPYSYSFLMALFHT
ILTFRLSYCHSNIVNHFYCDDMPLLRLTCSDTSSKQLWIFACAGIMFTSSLLIVFVSYMFIISAILRMRSAEGRQKAFST
CGSHMLAVTIFYGTLIFMYLQPSSNHALDTDKMASVFYTVIIPMLNPLIYSLRNKEVKEALKKVIINKNQ
>A0A2K5RFA5|unreviewed|Olfactory receptor|taxID:2715852
MSRWSNGSSNVSYTSFLLAGFPGLRESRALLVLPFLSLYLMIIFTNALVIHTVLTQQSLHQPMYLLIALLLAVNICATST
VVPAMLFSFSTHFNCISLARCLIQMFCIYFLIVFDCNILLVMALDRYVAICYPLCYPEIVTGQLLAGLVGAAATRSTCIV
APVVFLASRVHFCRSDMIQHFACEHMALMKLSCGDISLNKTVGLTVRIFNRVLDMLLLGTSYSHIIHAAFRISSGRARSK
ALNTCGSHLLVIFTVYLSTMSSSIVYRTAHTASQDVHNLLSTFYLLLPCLVNPIIYGTRTKEIRYHLGEQFLSSGPRDTM
MSI
A0A2K5BUY4
A0A2K5BUY7
A0A2K5BUZ1
A0A2K5BW26
A0A2K5BW36
A0A2K5BWA0
I want to create a second .fasta file, that contains only the sequences (including headers) whose accession codes are in the .txt file.
I am new to bioinformatics and to programming in general, so I did the following with help from chatGPT:
# Use pcregrep to extract sequences that match the accession codes
pcregrep -M -f platyrrhini_or_accessions.txt platyrrhini_784ORs.fasta | \
pcregrep -M ">([^>]+)(\n[^>]+)*" > filtered_platyrrhini_ORs.fasta
but this did not catch the sequences, as the output contained only the headers of the matching proteins:
>A0A2R8PLQ0|unreviewed|Olfactory receptor|taxID:9483
>A0A5F4VRG8|unreviewed|Olfactory receptor|taxID:9483
>A0A5F4VRT9|unreviewed|Olfactory receptor|taxID:9483
>A0A5F4VS80|unreviewed|Olfactory receptor|taxID:9483
I also tried with grep, but it didn't work because the sequences do not occupy a fixed number of lines:
grep -A 1 -f platyrrhini_or_accessions.txt platyrrhini_784ORs.fasta > filtered_platyrrhini_ORs.fasta
This captured the 229 headers but only 1 line from each sequence.
And lastly I used awk:
awk 'BEGIN {FS="|"} NR==FNR {accessions[$1]; next}
{if (substr($0, 1, 1) == ">") {seq_id = substr($0, 2); seq = $0}
else {seq = seq "\n" $0}
if (seq_id in accessions) {print seq_id "\n" seq}}' \
platyrrhini_or_accessions.txt platyrrhini_784ORs.fasta > filtered_platyrrhini_ORs.fasta
and the output file was empty.
I would really appreciate any help.
Using awk
for this is a good idea but there are several issues with your code:
With seq_id = substr($0, 2)
you assign not just the ID but the whole header minus the leading >
.
With accessions[$1]
the index you add to array accessions
is the ID without the leading >
. But the value of seq_id
in seq_id in accessions
has the leading >
.
You decide to print while processing the header (inside the if (substr($0, 1, 1) == ">" {...}
), so the rest of the sequence is not yet in your seq
variable.
...
If the header lines always start with a >
and no other lines start with a >
, you can try:
awk -F '|' '
NR == FNR {accessions[">" $1]; next}
/^>/ {to_print = $1 in accessions}
to_print' platyrrhini_or_accessions.txt platyrrhini_784ORs.fasta
awk
statements are pattern-action pairs applied to the input records, where the action is executed if and only if the pattern is true for the current record. In your case records are regular lines.
The pattern of the first statement (NR == FNR
) is true only for the lines of the first file. Its action (accessions[">" $1]; next
) stores the ID found in platyrrhini_or_accessions.txt
with an added leading >
as index of the accessions
array and moves to the next line (next
), such that the 2 other statements are skipped.
The pattern of the second statement (/^>/
) is a regular expression and evaluates as true for all lines with a leading >
. The action (to_print = $1 in accessions
) assigns true to variable to_print
if the first field of the line, that is, the ID with the leading >
, is an index of the accessions
array, else false.
The third statement has only the pattern part (to_print
), no action part. So it applies the default action (print) to the current line if to_print
is true, that is, if the last ID we encountered is in platyrrhini_or_accessions.txt
.
Note: in awk
there is no real boolean type; 0 and the empty string are considered as false, anything else as true. With any POSIX-compliant awk
, in the above code, the to_print
variable takes value 0 (false) or a non-zero numeric value (true), likely 1 (but this is not specified by POSIX).