Search code examples
c++regexpcre2

How to retrieve the captured substrings from a capturing group that may repeat?


I'm sorry I found it difficult to express this question with my poor English. So, let's go directly to a simple example.

Assume we have a subject string "apple:banana:cherry:durian". We want to match the subject and have $1, $2, $3 and $4 become "apple", "banana", "cherry" and "durian", respectively. The pattern I'm using is ^(\w+)(?::(.*?))*$, and $1 will be "apple" as expected. However, $2 will be "durian" instead of "banana".

Because the subject string to match doesn't need to be 4 items, for example, it could be "one:two:three", and $1 and $2 will be "one" and "three" respectively. Again, the middle item is missing.

What is the correct pattern to use in this case? By the way, I'm going to use PCRE2 in C++ codes, so there is no split, a Perl built-in function. Thanks.


Solution

  • If the input contains strictly items of interest separated by :, like item1:item2:item3, as the attempt in the question indicates, then you can use the regex pattern

    [^:]+
    

    which matches consecutive characters which are not :, so a substring up to the first :. That may need to capture as well, ([^:]+), depending on the overall approach. How to use this to get all such matches depends on the language.

    In C++ there are different ways to approach this. Using std::regex_iterator

    #include <string>
    #include <vector>
    #include <iterator>
    #include <regex>
    #include <iostream>
    
    int main()
    {
        std::string str{R"(one:two:three)"};
        std::regex r{R"([^:]+)"};
    
        std::vector<std::string> result{};
    
        auto it = std::sregex_iterator(str.begin(), str.end(), r);
        auto end = std::sregex_iterator();
        for(; it != end; ++it) {
            auto match = *it;
            result.push_back(match[0].str());
        }
    
        std::cout << "Input string: " << str << '\n';
        for(auto i : result)
            std::cout << i << '\n';
    }
    

    Prints as expected.

    One can also use std::regex_search, even as it returns at first match -- by iterating over the string to move the search start after every match

    #include <string>
    #include <regex>
    #include <iostream>
    
    int main()
    {
        std::string str{"one:two:three"};
        std::regex r{"[^:]+"};
    
        std::smatch res;
    
        std::string::const_iterator search_beg( str.cbegin() );
        while ( regex_search( search_beg, str.cend(), res, r ) )
        {
            std::cout << res[0] << '\n';  
            search_beg = res.suffix().first;
        }
        std::cout << '\n';
    }
    

    (With this string and regex we don't need the raw string literal so I've removed them here.)


    This question was initially tagged with perl (with no c++), also with a mention of it in text; the original version of this answer referred to Perl with

    /([^:]+)/g
    

    The // are pattern delimiters. The /g "modifier" is for "global," to find all matches.

    When this expression is bound (=~) to a variable with a target string, or to a string literal or to an expression yielding a scalar, then the whole expression returns a list of matches when used in a context in which a list is expected. Thus it can be directly assigned to an array variable, where the list assignment itself provides the context

    my @captures = $string =~ /[^:]+/g;
    

    (when this is used literally as shown then the capturing () aren't needed)

    Assigning to an array provides this "list context." If the matching is used in a "scalar context," in which a single value is expected, like in the condition for an if test or being assigned to a scalar variable, then a single true/false is returned (usually 1 or '', empty string).