Search code examples
pythonstringpython-re

how to match a string allowed "-" appear multiple times with python re?


I have a protein sequence:

`seq = "EIVLTQSPGTLSLSRASQS---VSSSYLAWYQQKPG"

and i want to match two type regions/strings:

the first type is continuous,like TQSPG in seq.

the second type we only know the continuous form, but in fact there may be multiple "-" characters in the middle,for example what i know is SQSVS, but in seq it is SQS---VS.

what i want to do is to match these two type of string and get the index, forexample TQSPG is (4,9), and for SQSVS is (16,24).

I tried use re.search('TQSPG',seq).span(), it return (4,9), but i don't konw how to deal the second type.


Solution

  • Assuming the order of SQSVS needs to be preserved, I'd propose the regex r'S-*Q-*S-*V-*S'. This will match the sequence SQSVS with any number (might be 0) of hyphens included between either of the letters.