Search code examples
pythonsequencebioinformaticsbiopythonfasta

Python: How to get rid of the sequences according to the sequence bases rather than their header name?


I would like to deduct two files based on the sequence constituents rather than using the header name to get rid of the sequences. Is there any other way I can deduct the sequences? can anyone help me? If the fasta header below is replaced with >human then the following code cannot function.

Code

from Bio import SeqIO

input_file = 'a.fasta'
merge_file = 'original.fasta'
output_file = 'results.fasta'
exclude = set()
fasta_sequences = SeqIO.parse(open(input_file),'fasta')
for fasta in fasta_sequences:
    exclude.add(fasta.id)

fasta_sequences = SeqIO.parse(open(merge_file),'fasta')
with open(output_file, 'w') as output_handle:
   for fasta in fasta_sequences:
        if fasta.id not in exclude:
            SeqIO.write([fasta], output_handle, "fasta")

a.fasta

>chr12:15747942-15747949
TGACATCA
>chr2:130918058-130918065
TGACCTCA

original.fasta

>chr3:99679938-99679945
TGACGTAA
>chr9:135822160-135822167
TGACCTCA
>chr12:15747942-15747949
TGACATCA
>chr2:130918058-130918065
TGACCTCA
>chr2:38430457-38430464
TGACCTCA
>chr1:112381724-112381731
TGACATCA

results.fasta

>chr3:99679938-99679945
TGACGTAA
>chr9:135822160-135822167
TGACCTCA
>chr2:38430457-38430464
TGACCTCA
>chr1:112381724-112381731
TGACATCA

Solution

  • You can check the sequences against one another. Be careful though, the sequences may not be 100% matches and they need to be for this method to give you the desired result. Access the sequence with str(your_obj.seq).

    In your code, implement the changes here:

    for fasta in fasta_sequences:
        exclude.add(str(fasta.seq))
    

    and here:

    for fasta in fasta_sequences:
            if str(fasta.seq) not in exclude:
    

    In your example, you should note that the results.fasta file will only contain the following line because it is the only sequence in original.fasta that doesn't match a sequence from a.fasta.

    >chr3:99679938-99679945
    TGACGTAA