Search code examples
awkgawkfasta

Trim first N bases in multi fasta file with awk and print with max width format


Background

The multi fasta format contains several record of sequences, each record begins with a single-line description, followed by several lines of sequence (RNA, DNA, protein). The description line has greaterthan symbol in the beginning, following ">" is the identifier of the sequence, and the rest of the line contains the description of record (both are optional).

in fasta files is common for lines sequences be formatted to a max width

input example, with max width="70":

>gi|304322925|ref|YP_003856771.1| NADH [Lynx rufus]
MMTYIVFILSTIFVVSFVGFSSKPSPIYGGFGLIVAGGIGCGIVLNFGGSFLGLMVFLIYLGGMLVVFGY
TTAMATEPYPEAWTSNKAVLGMLITGILAELLTACYILKEDEIEVVFKFNGAGDWVIYDTGDSGFFSEEA
MGIAALYSYGTWLVVVTGWSLLIGVLVIMEVTRGN
>gi|295065592|ref|YP_003587393.1| NADH [Nomascus siki]
MTYTLFLLSVILVMGFVGFSSKPSPIYGGLVLVVSGVVGCAVILNCGGGYLGLMVFLIYLGGMMVVFGYT
TAMAIEEYPEAWGSGVEVLVGVLVGLAMEVGLVLWAKECDGLVMVLNFNNMGSWVIYEGEGSGLIREDSI
GAGALYDYGRWLVVVTGWTLLVGVYIVIEIARGN
>gi|295065550|ref|YP_003587316.1| NADH (mitochondrion) [Symphalangus syndactylus]
MTYTLFLLSVILVMGFVGFSSKPSPIYGGLVLVVSGVVGCAIILDCGGGYLGLMVFLIYLGGMMVVFGYT
TAMAIEEYPEAWGSGVEVLVGVLVGLAMEVGLVLWAKEYDGLVVVLNFNNMGSWVIYEGEGSGLIREDSI
GAGALYDYGRWLVVVTGWTLLVGVYIVIEIARGN

My effort

I wrote a gawk script, it does left trim in sequences (cut first N bases) and if the length of sequence is less than trim then not print it.

gawk -v left_trim=15 -v width=70 '
  BEGIN{RS=">"}
  NR==1{next};        #for blank record
  {
    split ($0, a, "\n"); 
    sequence=""; 
    for(i=2; i<=length(a); i++){
      sequence=sequence a[i];
    }; 
    if(length(sequence)>left_trim) {
      printf(">%s\n%s\n",a[1], substr(sequence, left_trim+1));
    }
  }' test.fasta

result

>gi|304322925|ref|YP_003856771.1| NADH [Lynx rufus]
SFVGFSSKPSPIYGGFGLIVAGGIGCGIVLNFGGSFLGLMVFLIYLGGMLVVFGYTTAMATEPYPEAWTSNKAVLGMLITGILAELLTACYILKEDEIEVVFKFNGAGDWVIYDTGDSGFFSEEAMGIAALYSYGTWLVVVTGWSLLIGVLVIMEVTRGN
>gi|295065592|ref|YP_003587393.1| NADH [Nomascus siki]
FVGFSSKPSPIYGGLVLVVSGVVGCAVILNCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSGVEVLVGVLVGLAMEVGLVLWAKECDGLVMVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVVTGWTLLVGVYIVIEIARGN
>gi|295065550|ref|YP_003587316.1| NADH (mitochondrion) [Symphalangus syndactylus]
FVGFSSKPSPIYGGLVLVVSGVVGCAIILDCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSGVEVLVGVLVGLAMEVGLVLWAKEYDGLVVVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVVTGWTLLVGVYIVIEIARGN

expected

>gi|304322925|ref|YP_003856771.1| NADH [Lynx rufus]
SFVGFSSKPSPIYGGFGLIVAGGIGCGIVLNFGGSFLGLMVFLIYLGGMLVVFGYTTAMATEPYPEAWTS
NKAVLGMLITGILAELLTACYILKEDEIEVVFKFNGAGDWVIYDTGDSGFFSEEAMGIAALYSYGTWLVV
VTGWSLLIGVLVIMEVTRGN
>gi|295065592|ref|YP_003587393.1| NADH [Nomascus siki]
FVGFSSKPSPIYGGLVLVVSGVVGCAVILNCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSG
VEVLVGVLVGLAMEVGLVLWAKECDGLVMVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVV
TGWTLLVGVYIVIEIARGN
>gi|295065550|ref|YP_003587316.1| NADH (mitochondrion) [Symphalangus syndactylus]
FVGFSSKPSPIYGGLVLVVSGVVGCAIILDCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSG
VEVLVGVLVGLAMEVGLVLWAKEYDGLVVVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVV
TGWTLLVGVYIVIEIARGN

Questions

  • how may i implement width format? i try with ">%s\n%.70s\n", print first 70 letters, ">%s\n%70s\n" print all

  • how may i improve this join? exits a function?

    sequence=""; 
    for(i=2; i<=length(a); i++){
      sequence=sequence a[i];
    };
    

Solution

  • To answer your specific questions, you can specify the width of an output field using the * format modifier:

    $ awk 'BEGIN{printf "%s\n", "foo"}'
    foo
    $ awk 'BEGIN{printf "%*s\n", 10, "foo"}'
           foo
    

    and no, there is no join function to put arrays back together into a string (the opposite of split()) BUT if you just let awk split the record into fields instead of you manually splitting the record into an array of elements, then you can just assign a new value to any field and awk will recompile the fields into $0 as I'm doing as a deliberate side-effect in my first solution below when I delete the first field/line with $1 = "".

    Here's how I'd approach the task anyway:

    $ cat tst.awk
    BEGIN { RS=">"; FS="\n"; OFS="" }
    NR>1 {
        print RS $1
        $1 = ""
        for ( start=left_trim+1; start<=length(); start+=width ) {
            print substr($0,start,width)
        }
    }
    
    $ awk -v left_trim=15 -v width=70 -f tst.awk file
    >gi|304322925|ref|YP_003856771.1| NADH [Lynx rufus]
    SFVGFSSKPSPIYGGFGLIVAGGIGCGIVLNFGGSFLGLMVFLIYLGGMLVVFGYTTAMATEPYPEAWTS
    NKAVLGMLITGILAELLTACYILKEDEIEVVFKFNGAGDWVIYDTGDSGFFSEEAMGIAALYSYGTWLVV
    VTGWSLLIGVLVIMEVTRGN
    >gi|295065592|ref|YP_003587393.1| NADH [Nomascus siki]
    FVGFSSKPSPIYGGLVLVVSGVVGCAVILNCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSG
    VEVLVGVLVGLAMEVGLVLWAKECDGLVMVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVV
    TGWTLLVGVYIVIEIARGN
    >gi|295065550|ref|YP_003587316.1| NADH (mitochondrion) [Symphalangus syndactylus]
    FVGFSSKPSPIYGGLVLVVSGVVGCAIILDCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSG
    VEVLVGVLVGLAMEVGLVLWAKEYDGLVVVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVV
    TGWTLLVGVYIVIEIARGN
    

    or if you prefer:

    $ cat tst.awk
    /^>/ { prtRec(); rec=""; print; next }
    { rec = rec $0 }
    END { prtRec() }
    function prtRec(        start) {
        for ( start=left_trim+1; start<=length(rec); start+=width ) {
            print substr(rec,start,width)
        }
    }
    
    $ awk -v left_trim=15 -v width=70 -f tst.awk file
    >gi|304322925|ref|YP_003856771.1| NADH [Lynx rufus]
    SFVGFSSKPSPIYGGFGLIVAGGIGCGIVLNFGGSFLGLMVFLIYLGGMLVVFGYTTAMATEPYPEAWTS
    NKAVLGMLITGILAELLTACYILKEDEIEVVFKFNGAGDWVIYDTGDSGFFSEEAMGIAALYSYGTWLVV
    VTGWSLLIGVLVIMEVTRGN
    >gi|295065592|ref|YP_003587393.1| NADH [Nomascus siki]
    FVGFSSKPSPIYGGLVLVVSGVVGCAVILNCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSG
    VEVLVGVLVGLAMEVGLVLWAKECDGLVMVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVV
    TGWTLLVGVYIVIEIARGN
    >gi|295065550|ref|YP_003587316.1| NADH (mitochondrion) [Symphalangus syndactylus]
    FVGFSSKPSPIYGGLVLVVSGVVGCAIILDCGGGYLGLMVFLIYLGGMMVVFGYTTAMAIEEYPEAWGSG
    VEVLVGVLVGLAMEVGLVLWAKEYDGLVVVLNFNNMGSWVIYEGEGSGLIREDSIGAGALYDYGRWLVVV
    TGWTLLVGVYIVIEIARGN