Search code examples
pythonregexloopsfreeze

Regex match (\w+) to capture single words delimited by |||


I am trying to match if there's singe word followed by \s|||\s and then another single word followed by \s|||\s so I'm using this regex:

single_word_regex = r'(\w+)+\s\|\|\|\s(\w+)\s\|\|\|\s.*'

And when I tried to match this string, the regex matching hangs or take minutes (possibly going into some sort of "deep loop")

>>> import re
>>> import time
>>> single_word_regex = r'(\w+)+\s\|\|\|\s(\w+)\s\|\|\|\s.*'        
>>> x = u'amornratchatchawansawangwong ||| amornratchatchawansawangwong . ||| 0.594819 0.5 0.594819 0.25 ||| 0-0 0-1 ||| 1 1 1 ||| |||'
>>> z = u'amor 我 ||| amor . i ||| 0.594819 0.0585231 0.594819 0.0489472 ||| 0-0 0-1 1-2 ||| 2 2 2 ||| |||'
>>> y = u'amor ||| amor ||| 0.396546 0.0833347 0.29741 0.08 ||| 0-0 0-1 ||| 3 4 2 ||| |||'
>>> re.match(single_word_regex, z, re.U)                                              
>>> re.match(single_word_regex, y, re.U)                                          
<_sre.SRE_Match object at 0x105b879c0>
>>> start = time.time(); re.match(single_word_regex, y, re.U); print time.time() - start
9.60826873779e-05
>>> start = time.time(); re.match(single_word_regex, x, re.U); print time.time() - start # It hangs...

Why is it taking that long?

Is there a better/simpler regex to capture this conditionlen(x.split(' ||| ')[0].split()) == 1 == len(x.split(' ||| ').split())?


Solution

  • Note that by itself, a r'(\w+)+' pattern will not cause the catastrophic backtracking, it will only be "evil" inside a longer expression and especially when it is placed next to the start of the pattern since in case subsequent subpatterns fail the engine backtracks to this one, and as the 1+ quantifier inside is again quantified with +, that creates a huge amount of possible variations to try before failing. You may have a look at your regex demo and click the regex debugger on the left to see example regex engine behavior.

    The current regex can be written as

    r'^(\w+)\s\|{3}\s(\w+)\s\|{3}\s(.*)'
    

    See the regex demo where there will be a match if you remove space and . in the second field.

    Details:

    • ^ - start of string (not necessary with re.match)
    • (\w+) - (Group 1) 1+ letters/digits/underscores
    • \s - a whitespace
    • \|{3} - 3 pipe symbols
    • \s(\w+)\s\|{3}\s - see above ((\w+) creates Group 2)
    • (.*) - (Group 3) any 0+ characters other than linebreak characters.