I am trying to match if there's singe word followed by \s|||\s
and then another single word followed by \s|||\s
so I'm using this regex:
single_word_regex = r'(\w+)+\s\|\|\|\s(\w+)\s\|\|\|\s.*'
And when I tried to match this string, the regex matching hangs or take minutes (possibly going into some sort of "deep loop")
>>> import re
>>> import time
>>> single_word_regex = r'(\w+)+\s\|\|\|\s(\w+)\s\|\|\|\s.*'
>>> x = u'amornratchatchawansawangwong ||| amornratchatchawansawangwong . ||| 0.594819 0.5 0.594819 0.25 ||| 0-0 0-1 ||| 1 1 1 ||| |||'
>>> z = u'amor 我 ||| amor . i ||| 0.594819 0.0585231 0.594819 0.0489472 ||| 0-0 0-1 1-2 ||| 2 2 2 ||| |||'
>>> y = u'amor ||| amor ||| 0.396546 0.0833347 0.29741 0.08 ||| 0-0 0-1 ||| 3 4 2 ||| |||'
>>> re.match(single_word_regex, z, re.U)
>>> re.match(single_word_regex, y, re.U)
<_sre.SRE_Match object at 0x105b879c0>
>>> start = time.time(); re.match(single_word_regex, y, re.U); print time.time() - start
9.60826873779e-05
>>> start = time.time(); re.match(single_word_regex, x, re.U); print time.time() - start # It hangs...
Why is it taking that long?
Is there a better/simpler regex to capture this conditionlen(x.split(' ||| ')[0].split()) == 1 == len(x.split(' ||| ').split())
?
Note that by itself, a r'(\w+)+'
pattern will not cause the catastrophic backtracking, it will only be "evil" inside a longer expression and especially when it is placed next to the start of the pattern since in case subsequent subpatterns fail the engine backtracks to this one, and as the 1+ quantifier inside is again quantified with +
, that creates a huge amount of possible variations to try before failing. You may have a look at your regex demo and click the regex debugger on the left to see example regex engine behavior.
The current regex can be written as
r'^(\w+)\s\|{3}\s(\w+)\s\|{3}\s(.*)'
See the regex demo where there will be a match if you remove space and .
in the second field.
Details:
^
- start of string (not necessary with re.match
)(\w+)
- (Group 1) 1+ letters/digits/underscores\s
- a whitespace\|{3}
- 3 pipe symbols\s(\w+)\s\|{3}\s
- see above ((\w+)
creates Group 2)(.*)
- (Group 3) any 0+ characters other than linebreak characters.