Search code examples
pythonbioinformaticsbiopythonfuzzy-search

find a Pattern Match in string in Python


I am trying to find a amino acid pattern (B-C or M-D, where '-' could be any alphabet other than 'P') in a protein sequence let say 'VATLDSCBACSKVNDNVKNKVKVKNVKMLDHHHV'. Protein sequence in in a fasta file.

I have tried a lot but couldn't find any solution.

I tried a lot. the following code is one of them

import Bio
from Bio import SeqIO

seqs= SeqIO.parse(X, 'fasta') ### to read the sequences from fasta file
for aa in seqs:
    x=aa.seq ## gives the sequences as a string (.seq is a build in function of Biopython)
    
    for val, i in enumerate(x):          
        
        if i=='B':    
            if (x[val+2])=='C':
                
                if x[val+1]!='P':
                   pattern=((x[val]:x[val+2])) ## trying to print full sequence B-C
                

But unfortunately none of them work. It would be great if someone can help me out with this problem.


Solution

  • Use a regular expression with an exception assertion "^".

    import re
    
    string = 'VATLDSCBACSKVNDNVKNKVKVKNVKMLDHHHV'
    re.findall(r"B[^P]C|M[^P]D", string)
    

    Output:

    ['BAC', 'MLD']