Search code examples
regexpcre

Why no backtracing when using star with negated character set


Actually i understand what is backtracking, which is the state when the engine applies a greedy quantifier, which result in other atom failure so the engine starts to backtrack to the previous state to give up matchings gradually in the sake of matching remaining atoms.

but i got unexpected behavior when used this pattern "[^"]*" on this "abcv, which i wrote it to check what happens on failure. and i expected the engine to take these step:

  • the engine matches "
  • then the greedily quantified negated character set will match abcv
  • the engine fails to match the last "
  • so it should backtrack to the [^"]* to give up characters one by one trying to match the remaining atom.

but when i test this on regex101, the engine doesn't backtrack, but it starts over from another position each time it fails. so what am i missing here?

Is this what exactly been expected? if so, would any one explain why?

Update

I need to mention that ".*" backtracks and if you check the engine steps you will find that it starts giving characters one by one but the one with issue doesn't. why this difference while both .* and [^"]* are greedy quantifiers which matched the same text, but one had to back track, the other hadn't.


Solution

  • PCRE is using the "auto-possessification" optimization here as it "sees" that there is no way to match any other chars other than " between two ". See PCRE docs:

    PCRE_NO_AUTO_POSSESS

      If  this option is set, it disables "auto-possessification". This is an
      optimization that, for example, turns a+b into a++b in order  to  avoid
      backtracks  into  a+ that can never be successful. However, if callouts
      are in use, auto-possessification means that some  of  them  are  never
      taken. You can set this option if you want the matching functions to do
      a full unoptimized search and run all the callouts, but  it  is  mainly
      provided for testing purposes.
    

    You may easily check that by prepending "[^"]*" with the (*NO_AUTO_POSSESS) PCRE verb:

    enter image description here