Search code examples
pythonbiopython

Biopython issue in finding motifs in strings and deleting target sequence


Hi I have a large FASTA file which looks like this

>EMBOSS_001
GTCATCACAGTTTTCCCCGCCCTGTATATGGCTAATAGGCCCTCGCAATCTCCGATAAAT
>EMBOSS_002
CTGATGCTAGTCCCGTGTCCCAAACACTTCCGCAGAAGATCGCCCCGGGGGGCGTGTACC
>EMBOSS_003
CGCGCATGGACTCCATCCGTGATCTTTTGAGGCCATGAGTCCAAGTTTACCTCGGATATA
>EMBOSS_004
CGACCCGCCATTCTCCATCGTAACTTAGTCACGACGACAGTCAGCTTGTTCGTTCGTTAT

I would like to find all the sequences which have a specific motif and eliminate them For example if the motif is TTTCCC, the expected output should be:

>EMBOSS_002 CTGATGCTAGTCCCGTGTCCCAAACACTTCCGCAGAAGATCGCCCCGGGGGGCGTGTACC
>EMBOSS_003 CGCGCATGGACTCCATCCGTGATCTTTTGAGGCCATGAGTCCAAGTTTACCTCGGATATA
>EMBOSS_004 CGACCCGCCATTCTCCATCGTAACTTAGTCACGACGACAGTCAGCTTGTTCGTTCGTTAT

I wrote a code with Biopython:

from Bio.Seq import Seq
import Bio.motifs as motifs
from Bio import SeqIO

instances = [Seq("TTTCCC")]
m = motifs.create(instances)

reads = list(SeqIO.parse("/Users/EMBOSS-6.6.0/emboss/genome.fa", "fasta"))

for i in range(len(reads)):
    for pos, seq in m.instances.search(reads[i].seq):
        print("%i %s" % (pos, seq))

However it returns me only the info of the position of the start of the motifs, 11 TTTCCC I would like to return the info of also the sequence where it was found: EMBOSS_001 11 TTTCCC In addition, I would like the code to eliminate that sequence where the motif was found.

In addition, I am not able to delete the string where the motif was found and write it to output

for i in range(len(reads)):
    for pos, seq in m.instances.search(reads[i].seq):
        print(" %s %i %s" % (reads[i - 0][1:], pos, seq))
        del reads[i - 0:i]
        SeqIO.write(reads, "/Users/EMBOSS-6.6.0/emboss/results6.fa", "fasta")

Solution

  • I can't test this locally but I will explain with some code on how you can go about solving this problem.

    If the file structure is as you mentioned then one of the points you can derive is that where instances are checked they happen on an even line (say n) and what sequence they represent is simply n-1. So in order for you to output in this format EMBOSS_001 11 TTTCCC - the simplest of way is to use the index i as counter and determine the sequence.

    For example: reads[i-1] would give you the sequence >EMBOSS_001 for the FASTA GTCATCACAGTTTTCCCCGCCCTGTATATGGCTAATAGGCCCTCGCAATCTCCGATAAAT . To remove the >, set it to reads[i-1][1:].

    To eliminate the seq when motif TTTCCC is found, there are many ways to do this. The simplest would be to use the del method for python list-objects. This will simply remove the sequence and the element in which the motif occurs.

    This can be done easily and this is how the change would look in your code

    for pos, seq in m.instances.search(reads[i].seq):
            print(" %s %i %s" % (reads[i-1][1:],pos, seq))#should print in format EMBOSS_001 11 TTTCCC
            del reads[i-1:i]
    

    This should hopefully solve it. Let me know if you encounter any errors.

    Edit: My initial write intention should have looked like this - have also added a break statement to see if it solves the problem.

    for i in range(len(reads)):
        for pos, seq in m.instances.search(reads[i].seq):
            print(" %s %i %s" % (reads[i - 0][1:], pos, seq))
            del reads[i - 0:i]
            break 
    SeqIO.write(reads, "/Users/EMBOSS-6.6.0/emboss/results6.fa", "fasta")
    

    Where once the motifs were found it would eliminate that sequence. This means that only the strings without the motifs should now be written to SeqIO.write() method.