Search code examples
javascriptregexregex-group

Behaviour of the ? quantifier when applying to (\b) in JavaScript regex


I have a tiny regex: foo(\b)?. This was meant to be an experiment to see if I can deduce the existence of the boundary just by checking whether the first group was matched (and resulting in an empty string) or not.

I tried this with some languages: PHP/Python/Java/C#/RustInput manually. All of them behave as expected: An empty string for the first match and null/None/nothing for the second.
I can't figure out how to write a proper snippet in Go and C++, but regex101 says Go goes with those; I'm unsure about C++.

However, this is not the case with JS, as it outputs undefined for group 1 in both matches against foo food.

console.config({ maximize: true });

console.log(...'foo food'.matchAll(/foo(\b)?/g));
<script src="https://gh-canon.github.io/stack-snippet-console/console.min.js"></script>

Yet, (\b) without ? does capture an empty string.

console.config({ maximize: true });

console.log(...'foo food'.matchAll(/foo(\b)/g));
<script src="https://gh-canon.github.io/stack-snippet-console/console.min.js"></script>

Considering that ? is greedy, shouldn't (\b) always match and capture an empty string after the first foo, as with other languages? What are the alternatives?

I can reproduce this in both NodeJS and Chrome (V8) as well as Firefox (Gecko), so this is probably a quirk rather than a bug.


Solution

  • As discussed in both the questions and the comments, this is a quirk. I don't know why nor how, but I have found an alternative: foo(?:(\b)|). Group 1 results in an empty string if the first branch matched and nothing otherwise, effectively disabling this strange behaviour of ?.

    [...'foo food'.matchAll(/foo(?:(\b)|)/g)]
    
    // [0: 'foo', 1: '']
    // [0: 'foo', 1: undefined]
    

    Try it on regex101.com.

    Try it:

    console.config({ maximize: true });
    
    console.log(...'foo food'.matchAll(/foo(?:(\b)|)/g));
    <script src="https://gh-canon.github.io/stack-snippet-console/console.min.js"></script>

    An empty branch is most oftenly seen as a non-recommended version of ?,[citation needed] but it seems that they have some differences after all, at least in ECMAScript.