I am trying to transform a text file like this (fasta format):
>seq1
AAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAA
ATGATGATGGAATGAGGAT
TTAGGAGGGAGGAAAATTC
>seq2
CCCTCCGGGAAAAAAGAGG
TTGCAATGCGCGTATTTAT
TTTTTTTTTTTTTTTTTTT
AAAAAAAAAAAAAGGCTGT
AAAAAAAAAAAAAAAGGGG
The objective is to displace newline character 5 positions downstream, except for those lines starting with >
>seq1
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAATGATGATGGAATGA
GGATTTAGGAGGGAGGAAAATTC
>seq2
CCCTCCGGGAAAAAAGAGGTTGCA
ATGCGCGTATTTATTTTTTTTTTT
TTTTTTTTTAAAAAAAAAAAAAGG
CTGTAAAAAAAAAAAAAAAGGGG
I would like to use AWK, but I am not sure how to proceed. I am thinking about something similar to this:
awk '{for(i=1;i<=NR;i++){ if($1 ~ /^>/){¿?¿?¿?}}}'
Do you know how can I solve this?
Assumptions:
One awk
idea:
awk -v width=24 ' # pass width in as awk variable "width"
function print_sequence() {
if (sequence) # if sequence is not blank
while (sequence) { # while sequence is not blank
print substr(sequence,1,width) # print 1st 24 characters
sequence=substr(sequence,width+1) # remove 1st 24 characters
}
}
/^>/ { print_sequence() # flush previous set of data to stdout
print # print current input line
next # process next input line
}
{ sequence=sequence $1 } # append data to our "sequence" variable
END { print_sequence() } # flush last set of data to stdout
' fasta.in > fasta.out
This generates:
$ cat fasta.out
>seq1
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAATGATGATGGAATGA
GGATTTAGGAGGGAGGAAAATTC
>seq2
CCCTCCGGGAAAAAAGAGGTTGCA
ATGCGCGTATTTATTTTTTTTTTT
TTTTTTTTTAAAAAAAAAAAAAGG
CTGTAAAAAAAAAAAAAAAGGGG