Search code examples
regexrubyruby-2.7

Is this a bug in ruby Regexp? How to guard against "infinite loop" from regex match without using Timeout?


I have this regex:

regex = /(Si.ges[a-zA-Z\W]*avec\W*fonction\W*m.moires)/i

And when I use it on some, but not all, texts e.g. this one:

text = "xation de 2 sièges-enfants sur la banquette AR),Pack \"Assistance\",Keyless Access avec alarme : Système de verrouillage/déverrouillage et de démarrage sans clé,Park Assist: Système d'assistance au stationnement en créneauet et en bataille,Rear Assist: Caméra de recul avec visualisation de la zone situ"

like so: text.match(regex), then ruby just runs in what seems like an infinite loop - but why? And is there anyway to guard against this, e.g. by having ruby throw an exception instead - without using the Timeout as it is a known issue when using it with Sidekiq (https://github.com/mperham/sidekiq/wiki/Problems-and-Troubleshooting#add-timeouts-to-everything)

ruby version: 2.7.2


Solution

  • Built-in character classes are more table-driven.
    Given that, Negative built-in ones like \W, \S etc...
    are difficult for engines to merge into a positive character class.

    In this case, there are some obvious bugs because as you've said, it doesn't time out on
    some target strings.

    In fact, [a-xzA-XZ\W] works given the sample string. It times out when Y is included anywhere
    but just for that particular string.

    Let's see if we can determine if this is a bug or not.

    First, some tests:

    Test - Fail [a-zA-Z\W]

    https://rextester.com/FHUQG84843

    # Test - Fail  [a-zA-Z\W]
    puts "Hello World!";
    regex = /(Si.ges[a-zA-Z\W]*avec\W*fonction\W*m.moires)/ui;
    text = "xation de 2 sièges-enfants sur la banquette AR),Pack \"Assistance\",Keyless Access avec alarme : Système de verrouillage/déverrouillage et de démarrage sans clé,Park Assist: Système d'assistance au stationnement en créneauet et en bataille,Rear Assist: Caméra de recul avec visualisation de la zone situ";
    res = text.match(regex);
    puts "Done";
    

    Test - Pass [a-xzA-XZ\W]

    https://rextester.com/RPV28606

    Test - Pass [a-zA-Z\P{Word}]

    https://rextester.com/DAMW9069


    Conclusion: Report this as a BUG.
    IMO this is a BUG with their built-in class \W which is engine defined,
    since \P{Word} is a Unicode property defined function, not a range.
    And we see that [a-zA-Z\P{Word}] works just fine.
    Use \P{Word} inside classes as a temporary workaround.

    In reality when modern-day engines were first designed, the logic of what
    a negative class was [^] each item is AND NOT which when combined with a positive
    class where each item is ORed results in errors in scope.
    Perl had class errors still a short time ago.