Search code examples
pythonbiopythonfasta

Filtering a FASTA file based on restriction-sequence with BioPython


I have a fasta file. From that file, I need to get the only sequences containing 'CCNNNGG' (where 'N' represents random nucleotides) and put them in a new fasta file.

Example (it should output the first sequence):

m121012_054644_42133_c100390582550000001523038311021245_s1_p0/7/3312_3597 CGCGGCATCGAATTAATACGACTCACTATAGGTTTTTTTATTG*********CCTACGG***********GTATTTTCAGTTAGATTCTTTCTTCTTAGAGGGTACAGAGAAAGGGAGAAAATAGCTACAGACATGGGAGTGAAAGGTAGGAAGAAGAGCGAAGCAGACATTATTCA

m121012_054644_42133_c100390582550000001523038311021245_s1_p0/7/3708_4657 CAACGGTTTTGCCACAAGATCAGGAACATAAGTCACCAGACTCAATTCATCCCCATAAGACCTCGGACCTCTCAATCCTCGAATTAGGATGTTCTCGTACGGTCTATCAGTATATAAACCTGACATACTATAAAAAAGTATACCAT TCTTATCATGTACAGTAGGGTACAGTAGG

(*s added for highlighting)

And my code:

 from Bio import SeqIO 


    my_sequences = []
    for record in SeqIO.parse(open("example.fa", "rU"), "fasta") :
         if "CCTACGG" in record.seq :                #Works fine with CCTACGG

            my_sequences.append(record)



    output_handle = open("my_seqs.fasta", "w")
    SeqIO.write(my_sequences, output_handle, "fasta")
    output_handle.close()

My problem is that I don't know how to write random nucleotides, instead of write "CCTACGG" after if I want to put 'CCNNNGG', where N represents random nucleotides ('C' or 'T' or 'G' or 'A').


Solution

  • You can use regular expressions to do this, via Python's re module:

    import re
    
    pattern = 'CCNNNGG'
    regex = re.compile(pattern.replace('N', '[ACGT]'))
    for record in SeqIO.parse(...):
        if re.search(regex, record.seq) is not None:
            my_sequences.append(record)
    

    This replaces every 'N' in your pattern with '[ACGT]', which will match any one of those four characters, then searches for that pattern in each record.seq.

    Also, note that your examples aren't very good - the second one also matches that pattern (it contains 'CCCATGG') - see the results!