Search code examples
regexawksedgrepcut

Filter (or 'cut') out column that begins with 'OS=abc'


My .fasta file consists of this repeating pattern.

>sp|P20855|HBB_CTEGU Hemoglobin subunit beta OS=Ctenodactylus gundi OX=10166 GN=HBB PE=1 SV=1
asdfaasdfaasdfasdfa
>sp|Q00812|TRHBN_NOSCO Group 1 truncated hemoglobin GlbN OS=Nostoc commune OX=1178 GN=glbN PE=3 SV=1
asdfadfasdfaasdfasdfasdfasd
>sp|P02197|MYG_CHICK Myoglobin OS=Gallus gallus OX=9031 GN=MB PE=1 SV=4
aafdsdfasdfasdfa

I want to filter out only the lines that contain '>' THEN filter out the string after 'OS=' and before 'OX=', (example line1=Ctenodactylus gundi)

The first part('>') is easy enough:

grep '>' my.fasta | cut -d " " -f 3 >> species.txt

The problem is that the number of fields is not constant BEFORE 'OS='.

But the number of column/fields between 'OS=' and 'OX=' is 2.


Solution

  • You can use the -P option to enable PCRE-based regex matching, and use lookaround patterns to ensure that the match is enclosed between OS= and OX=:

    grep '>' my.fasta | grep -oP '(?<=OS=).*(?=OX=)'
    

    Note that the -P option is available only to the GNU's version of grep, which may not be available by default in some environments.