Search code examples
split

How can I split a string into a constant number of characters, moving 1 character at a time?


I would really appreciate your help.

I want to split a fasta file into strings of 7 characters separated by newlines and save the output in a new file.

input file:

$ cat file.fasta  

name_of_the_protein  
MRPPQCLLHTPSLASPLLLLLLWLLGGGVGAEGREDAELLVTVRGGRLRGIRLKTPGGPVSAFLGIPFAE  
PPMGPRRFLPPEPKQPWSGVVDATTFQSVCYQYVDTLYPGFEGTEMWNPNRELSEDCLYLNVWTPYPRPT  

expected output:

$ cat new_file.txt  

MRPPQCL  
RPPQCLL  
PPQCLLH  
PQCLLHT  
QCLLHTP  
CLLHTPS

Solution

  • Since you clarified the original request, you may run awk from the linux or cygwin command line to get what you like:

    awk 'BEGIN { targetLength=7 } (NR>1) { for(i=1; i < length($0)-targetLength+2; i++) { print(substr($0,i,targetLength)) }     
    }' file.fasta
    

    Alternatively you put the following content into a separate file (sequencer.awk):

    BEGIN { targetLength=7 }
    (NR>1) {
       for(i=1; i < length($0)-targetLength+2; i++) {
         print(substr($0,i,targetLength))
       }
    }
    

    and run it with awk -f sequencer.awk file.fasta.

    This assumes you only have two lines in your file, one with the name, one with the sequence (beware that some text viewers introduce linewrapping). If you have more than two lines, the results for each line would concatenate on the output.