Search code examples
pythonpython-2.7python-3.xbiopythonfasta

Python: How to print out sequences with length n from sliding window in FASTA file?


I have a fasta file with few sequences and I would like to perform sliding windows of window size 5 and extract the sequences whenever it sweeps through the sequence.

For example ( test1.fasta ):
>human1
ATCGCGTC
>human2
ATTTTCGCGA

Expected output ( test1_out.txt ):
>human1
ATCGC
>human1
TCGCG
>human1
CGCGT
>human1
GCGTC
>human2
ATTTT
>human2
TTTTC
>human2
TTTCG
>human2
TTCGC
>human2
TCGCG
>human2
CGCGA

My following code only able to extract the first five base pairs. How can I shift the window to extract 5 bp for every step size of 1 with window size 5?

from Bio import SeqIO

with open("test1_out.txt","w") as f:
            for seq_record in SeqIO.parse("test1.fasta", "fasta"):

            f.write(str(seq_record.id) + "\n")
            f.write(str(seq_record.seq[:5]) + "\n")  #first 5 base positions

Above code I got it from other example in stackoverflow*


Solution

  • So I guess "seq_record.seq" is the whole DNA sequece like in human1 "ATCGCGTC". You can write like this:

    from Bio import SeqIO
    
    with open("test1_out.txt","w") as f:
            for seq_record in SeqIO.parse("test1.fasta", "fasta"):
                for i in range(len(seq_record.seq) - 4) :
                   f.write(str(seq_record.id) + "\n")
                   f.write(str(seq_record.seq[i:i+5]) + "\n")  #first 5 base positions