Search code examples
awkfasta

Move new line character 5 positions downstream in a text (fasta) file


I am trying to transform a text file like this (fasta format):

>seq1
AAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAA
ATGATGATGGAATGAGGAT
TTAGGAGGGAGGAAAATTC
>seq2
CCCTCCGGGAAAAAAGAGG
TTGCAATGCGCGTATTTAT
TTTTTTTTTTTTTTTTTTT
AAAAAAAAAAAAAGGCTGT
AAAAAAAAAAAAAAAGGGG

The objective is to displace newline character 5 positions downstream, except for those lines starting with >

>seq1
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAATGATGATGGAATGA
GGATTTAGGAGGGAGGAAAATTC
>seq2
CCCTCCGGGAAAAAAGAGGTTGCA
ATGCGCGTATTTATTTTTTTTTTT
TTTTTTTTTAAAAAAAAAAAAAGG
CTGTAAAAAAAAAAAAAAAGGGG

I would like to use AWK, but I am not sure how to proceed. I am thinking about something similar to this:

awk '{for(i=1;i<=NR;i++){ if($1 ~ /^>/){¿?¿?¿?}}}'

Do you know how can I solve this?


Solution

  • Assumptions:

    • all data lines are to be expanded to a max of 24 characters

    One awk idea:

    awk -v width=24 '                               # pass width in as awk variable "width"
    function print_sequence() {
        if (sequence)                               # if sequence is not blank
           while (sequence) {                       # while sequence is not blank
                 print substr(sequence,1,width)     # print 1st 24 characters
                 sequence=substr(sequence,width+1)  # remove 1st 24 characters
           }
    }
    
    /^>/ { print_sequence()                         # flush previous set of data to stdout
           print                                    # print current input line
           next                                     # process next input line
         }
         { sequence=sequence $1 }                   # append data to our "sequence" variable
    
    END  { print_sequence() }                       # flush last set of data to stdout
    ' fasta.in > fasta.out
    

    This generates:

    $ cat fasta.out
    >seq1
    AAAAAAAAAAAAAAAAAAAAAAAA
    AAAAAAAAAAAAAAAAAAAAAAAA
    AAAAAAAAAATGATGATGGAATGA
    GGATTTAGGAGGGAGGAAAATTC
    >seq2
    CCCTCCGGGAAAAAAGAGGTTGCA
    ATGCGCGTATTTATTTTTTTTTTT
    TTTTTTTTTAAAAAAAAAAAAAGG
    CTGTAAAAAAAAAAAAAAAGGGG