I am trying to find a amino acid pattern (B-C or M-D, where '-' could be any alphabet other than 'P') in a protein sequence let say 'VATLDSCBACSKVNDNVKNKVKVKNVKMLDHHHV'. Protein sequence in in a fasta file.
I have tried a lot but couldn't find any solution.
I tried a lot. the following code is one of them
import Bio
from Bio import SeqIO
seqs= SeqIO.parse(X, 'fasta') ### to read the sequences from fasta file
for aa in seqs:
x=aa.seq ## gives the sequences as a string (.seq is a build in function of Biopython)
for val, i in enumerate(x):
if i=='B':
if (x[val+2])=='C':
if x[val+1]!='P':
pattern=((x[val]:x[val+2])) ## trying to print full sequence B-C
But unfortunately none of them work. It would be great if someone can help me out with this problem.
Use a regular expression with an exception assertion "^".
import re
string = 'VATLDSCBACSKVNDNVKNKVKVKNVKMLDHHHV'
re.findall(r"B[^P]C|M[^P]D", string)
Output:
['BAC', 'MLD']