I would really appreciate your help.
I want to split a fasta file into strings of 7 characters separated by newlines and save the output in a new file.
input file:
$ cat file.fasta
name_of_the_protein
MRPPQCLLHTPSLASPLLLLLLWLLGGGVGAEGREDAELLVTVRGGRLRGIRLKTPGGPVSAFLGIPFAE
PPMGPRRFLPPEPKQPWSGVVDATTFQSVCYQYVDTLYPGFEGTEMWNPNRELSEDCLYLNVWTPYPRPT
expected output:
$ cat new_file.txt
MRPPQCL
RPPQCLL
PPQCLLH
PQCLLHT
QCLLHTP
CLLHTPS
Since you clarified the original request, you may run awk
from the linux or cygwin command line to get what you like:
awk 'BEGIN { targetLength=7 } (NR>1) { for(i=1; i < length($0)-targetLength+2; i++) { print(substr($0,i,targetLength)) }
}' file.fasta
Alternatively you put the following content into a separate file (sequencer.awk
):
BEGIN { targetLength=7 }
(NR>1) {
for(i=1; i < length($0)-targetLength+2; i++) {
print(substr($0,i,targetLength))
}
}
and run it with awk -f sequencer.awk file.fasta
.
This assumes you only have two lines in your file, one with the name, one with the sequence (beware that some text viewers introduce linewrapping). If you have more than two lines, the results for each line would concatenate on the output.