Search code examples
bashawkfasta

AWK script for search in a fasta file


I have such a fasta file :

>gnl|SRA|SRR035294.8571.2 FIHSSUW01ASCWS.2 length=224
GAGATGAAATAGATCTTGGCATATATGTACATGCTTGATCTCAGTTTTGATTGGATTTTATCCATTTTAG
CTATCTTAACTATTAATCTTGAAATGAAGCTTTAATTTATGTAGGAAGTTTATGAAATTTAGGAAAAAAA
AAGAAAAAAACAAAACAATGTCGGCCGCCTCGGTCTCTACTGAGACACGCAACAGGGGATAGGCAAGGCA
CACAGGGGATAGGN
>gnl|SRA|SRR035294.8572.2 FIHSSUW01ETZME.2 length=254
ACTAACCAGGTGGTAAACAACTACTACAGGCCAGATTTGAAGAAGGCTGCTCTTGCTAGATTGAGTGCAG
TGAACAGAAGCCTTAAGGTTTCAAAGTCTGGTGTGAAGAAGAAGAACAGACAGGCAGTTAGGATCCATGG
TAGGAAGTGAAGCTGTGATTTGCCTACCGTCTGATATTCATCGTATCACTTTCTAGCTGTTCCGTCTTGT
TTGGCAAGTGTTTGGTTTTACGTGCGAGTAGTTATATGTTGCGC
>gnl|SRA|SRR035294.8573.2 FIHSSUW01AZA99.2 length=230
AAGCAGTGGTATCAACGCAGAGTGGCCATTACGGCCGGGGATGTACCAATTCAAAAAGAAAACAGCAGTT
GGGGGCAAAACAATTAAGTTGTAACGAATGCATATATATGATTAATCTTCTAACACATTATTTTTGTCTC
AAAAAAAAAGAAAAAAAACAAAACATGTCGGCCGCCTCGGTCTCTACTGAGACACGCAACAGGGGATAGG
CAAGGCACACAGGGGATAGG
>gnl|SRA|SRR035294.8574.2 FIHSSUW01EHI3P.2 length=153
TGCAAGTTTACAACTTAAAACAACTTTTCTCACAGTGAACAATAAATTTATCAATTCTCATGCAAAAAAA
AAGAAAAAAACAAAAACATGTCGGCCGCCTCGGTCTCTACTGAGACACGCAACAGGGGATAGGCAAGGCA
CACAGGGGATAGG
>gnl|SRA|SRR035294.8575.2 FIHSSUW01EWK4S.2 length=287
AACAGTGGTATCAACGCAGAGTGGCCATTACGGCCGGGAGATTACAGGTATTGCAAGTTTCAAGCCTGTC
ATAAAGACTCAAAGCCGCTTGTAATTTGTGTTTCCTAGTTGGGGAAGCTGTTTGTTCTTTATTGTGCTAT
ATGTATTTATTTGAAAGTTTGGATGAACTCAATAAATAAAAGAAAATCTTCATTGTGGGTTACAATTTGG
ACATGAACATGCATGAATAATGTACCAATTTAGCAAAAAAAAAGAAAAAAACAAAAAACAAATAGTCGGC
CGGCCCG
>gnl|SRA|SRR035294.8576.2 FIHSSUW01C911A.2 length=265
TATTCTCAGGTACGAAATATGAGTTTGCTGATAAATTGATGGATTGGGAATCAGCCTGCATAATAAGATA
TTCCCAATTAACTTTGCCCGTTAGTTCTTTTAGCTTTTCCTTTAAAGGCACGAGTCTTTCAACCAAAACA
TTACAGCAAAGTCTAACTGCCTCACAGCTTGCTTCAGAAGTTGTACCCCCGGCCGTAATGGCCACTCTGC
GTTGATACCACTGCTTCTGAGACACGCAACAGGGGATAGGCAAGGCACACAGGGG

and i have wrote this script in bash

STRING=$1
FILE=$(pwd)"/"$2

if [ -z "$STRING" ] 
then 
    echo "Usage: fastaFind.sh <query> <fasta file>"
else
    echo ""
    awk  'BEGIN { RS = ">" } ; $0 ~ "'$STRING'" { print $0 }' "$FILE"
fi

I am running this command

 fastaFind.sh "gnl|SRA|SRR035294.8573.2 FIHSSUW01AZA99.2 length=230" file.fasta

but it returns me an error for unterminated string. What i want to achieve is to retrieve after the execution of the command the specific sequence of the query. e.g

>gnl|SRA|SRR035294.8573.2 FIHSSUW01AZA99.2 length=230
AAGCAGTGGTATCAACGCAGAGTGGCCATTACGGCCGGGGATGTACCAATTCAAAAAGAAAACAGCAGTT
GGGGGCAAAACAATTAAGTTGTAACGAATGCATATATATGATTAATCTTCTAACACATTATTTTTGTCTC
AAAAAAAAAGAAAAAAAACAAAACATGTCGGCCGCCTCGGTCTCTACTGAGACACGCAACAGGGGATAGG

Solution

  • Your awk command would better be:

    awk 'BEGIN{ ORS = ""; RS = ">"; FS="\n" } $1 == "pattern" { print ">" $0 }' file
    

    Or

    awk -v p="pattern" 'BEGIN {ORS = ""; RS = ">"; FS = "\n" } $1 == p { print ">" $0 }' file
    

    And your shell script be:

    #!/bin/bash
    
    STRING=$1
    FILE=$2
    
    if [[ -z $STRING ]]; then
        echo "Usage: fastaFind.sh <query> <fasta file>"
    else
        awk -v p="$STRING" 'BEGIN{ ORS=""; RS=">"; FS="\n" } $1 == p { print ">" $0 }' "$FILE"
    fi
    

    Example usage:

    bash temp.sh 'gnl|SRA|SRR035294.8575.2 FIHSSUW01EWK4S.2 length=287' temp.txt
    

    Output:

    >gnl|SRA|SRR035294.8575.2 FIHSSUW01EWK4S.2 length=287
    AACAGTGGTATCAACGCAGAGTGGCCATTACGGCCGGGAGATTACAGGTATTGCAAGTTTCAAGCCTGTC
    ATAAAGACTCAAAGCCGCTTGTAATTTGTGTTTCCTAGTTGGGGAAGCTGTTTGTTCTTTATTGTGCTAT
    ATGTATTTATTTGAAAGTTTGGATGAACTCAATAAATAAAAGAAAATCTTCATTGTGGGTTACAATTTGG
    ACATGAACATGCATGAATAATGTACCAATTTAGCAAAAAAAAAGAAAAAAACAAAAAACAAATAGTCGGC
    CGGCCCG