My .fasta file consists of this repeating pattern.
>sp|P20855|HBB_CTEGU Hemoglobin subunit beta OS=Ctenodactylus gundi OX=10166 GN=HBB PE=1 SV=1
asdfaasdfaasdfasdfa
>sp|Q00812|TRHBN_NOSCO Group 1 truncated hemoglobin GlbN OS=Nostoc commune OX=1178 GN=glbN PE=3 SV=1
asdfadfasdfaasdfasdfasdfasd
>sp|P02197|MYG_CHICK Myoglobin OS=Gallus gallus OX=9031 GN=MB PE=1 SV=4
aafdsdfasdfasdfa
I want to filter out only the lines that contain '>' THEN filter out the string after 'OS=' and before 'OX=', (example line1=Ctenodactylus gundi)
The first part('>') is easy enough:
grep '>' my.fasta | cut -d " " -f 3 >> species.txt
The problem is that the number of fields is not constant BEFORE 'OS='.
But the number of column/fields between 'OS=' and 'OX=' is 2.
You can use the -P
option to enable PCRE-based regex matching, and use lookaround patterns to ensure that the match is enclosed between OS=
and OX=
:
grep '>' my.fasta | grep -oP '(?<=OS=).*(?=OX=)'
Note that the -P
option is available only to the GNU's version of grep
, which may not be available by default in some environments.