I have a protein sequence:
`seq = "EIVLTQSPGTLSLSRASQS---VSSSYLAWYQQKPG"
and i want to match two type regions/strings:
the first type is continuous,like TQSPG
in seq.
the second type we only know the continuous form, but in fact there may be multiple "-" characters in the middle,for example what i know is SQSVS
, but in seq it is SQS---VS.
what i want to do is to match these two type of string and get the index, forexample TQSPG
is (4,9)
, and for SQSVS
is (16,24).
I tried use re.search('TQSPG',seq).span()
, it return (4,9)
, but i don't konw how to deal the second type.
Assuming the order of SQSVS
needs to be preserved, I'd propose the regex r'S-*Q-*S-*V-*S'
. This will match the sequence SQSVS
with any number (might be 0) of hyphens included between either of the letters.