Search code examples
c++regextokenizelexercapturing-group

C++ regex: Get index of the Capture Group the SubMatch matched to


Context. I'm developing a Lexer/Tokenizing engine, which would use regex as a backend. The lexer accepts rules, which define the token types/IDs, e.g.

<identifier> = "\\b\\w+\\b".

As I envision, to do the regex match-based tokenizing, all of the rules defined by regexes are enclosed in capturing groups, and all groups are separated by ORs.

When the matching is being executed, every match we produce must have an index of the capturing group it was matched to. We use these IDs to map the matches to token types.

So the problem of this question arises - how to get the ID of the group?

Similar question here, but it does not provide the solution to my specific problem.

Exactly my problem here, but it's in JS, and I need a C/C++ solution.

So let's say I've got a regex, made up of capturing groups separated by an OR:

(\\b[a-zA-Z]+\\b)|(\\b\\d+\\b)

which matches the the whole numbers or alpha-words.

My problem requires that the index of the capture group the regex submatch matched to could be known, e.g. when matching the string

foo bar 123

3 iterations will be done. The group indexes of the matches of every iteration would be 0 0 1, because the first two matches matched the first capturing group, and the last match matched the second capturing group.

I know that in standard std::regex library it's not entirely possible (regex_token_iterator is not a solution, because I don't need to skip any matches).

I don't have much knowledge about boost::regex or PCRE regex library.

What is the best way to accomplish this task? Which is the library and method to use?


Solution

  • You may use the sregex_iterator to get all matches, and once there is a match you may analyze the std::match_results structure and only grab the ID-1 value of the group that participated in the match (note only one group here will match, either the first one, or the second), which can be conveniently checked with the m[index].matched:

    std::regex r(R"((\b[[:alpha:]]+\b)|(\b\d+\b))");
    std::string s = "foo bar 123";
    for(std::sregex_iterator i = std::sregex_iterator(s.begin(), s.end(), r);
                             i != std::sregex_iterator();
                             ++i)
    {
        std::smatch m = *i;
        std::cout << "Match value: " << m.str() << " at Position " << m.position() << '\n';
    
        for(auto index = 1; index < m.size(); ++index ){
            if (m[index].matched) {
                std::cout << "Capture group ID: " << index-1 << std::endl;
                break;
            }
        }
    }
    

    See the C++ demo. Output:

    Match value: foo at Position 0
    Capture group ID: 0
    Match value: bar at Position 4
    Capture group ID: 0
    Match value: 123 at Position 8
    Capture group ID: 1
    

    Note that R"(...)" is a raw string literal, no need to double backslashes inside it.

    Also, index is set to 1 at the start of the for loop because the 0th group is the whole match, but you want group IDs to be zero-based, that is why 1 is subtracted later.