Search code examples
c++boostboost-regex

boost regex iterator returning empty string


I am a beginner to regex in c++ I was wondering why this code:

#include <iostream>
#include <string>
#include <boost/regex.hpp>

int main() {

   std::string s = "? 8==2 : true ! false";
   boost::regex re("\\?\\s+(.*)\\s*:\\s*(.*)\\s*\\!\\s*(.*)");

   boost::sregex_token_iterator p(s.begin(), s.end(), re, -1);  // sequence and that reg exp
   boost::sregex_token_iterator end;    // Create an end-of-reg-exp
                                        // marker
   while (p != end)
      std::cout << *p++ << '\n';
}

Prints a empty string. I put the regex in regexTester and it matches the string correctly but here when I try to iterate over the matches it returns nothing.


Solution

  • I think the tokenizer is actually meant to split text by some delimiter, and the delimiter is not included. Compare with std::regex_token_iterator:

    std::regex_token_iterator is a read-only LegacyForwardIterator that accesses the individual sub-matches of every match of a regular expression within the underlying character sequence. It can also be used to access the parts of the sequence that were not matched by the given regular expression (e.g. as a tokenizer).

    Indeed you invoke exactly this mode as per the docs:

    if submatch is -1, then enumerates all the text sequences that did not match the expression re (that is to performs field splitting).

    (emphasis mine).

    So, just fix that:

    for (boost::sregex_token_iterator p(s.begin(), s.end(), re), e; p != e;
         ++p)
    {
        boost::sub_match<It> const& current = *p;
        if (current.matched) {
            std::cout << std::quoted(current.str()) << '\n';
        } else {
            std::cout << "non matching" << '\n';
        }
    }
    

    Other Observations

    All the greedy Kleene-stars are recipe for trouble. You won't ever find a second match, because the first one's .* at the end will by definition gobble up all remaining input.

    Instead, make them non-greedy (.*?) and or much more precise (like isolating some character set, or mandating non-space characters?).

    boost::regex re(R"(\?\s+(.*?)\s*:\s*(.*?)\s*\!\s*(.*?))");
    
    // Or, if you don't want raw string literals:
    boost::regex re("\\?\\s+(.*?)\\s*:\\s*(.*?)\\s*\\!\\s*(.*?)");
    

    Live Demo

    #include <boost/regex.hpp>
    #include <iomanip>
    #include <iostream>
    #include <string>
    
    int main() {
        using It = std::string::const_iterator;
        std::string const s = 
            "? 8==2 : true ! false;"
            "? 9==3 : 'book' ! 'library';";
        boost::regex re(R"(\?\s+(.*?)\s*:\s*(.*?)\s*\!\s*(.*?))");
    
        {
            std::cout << "=== regex_search:\n";
            boost::smatch results;
            for (It b = s.begin(); boost::regex_search(b, s.end(), results, re); b = results[0].end()) {
                std::cout << results.str() << "\n";
                std::cout << "remain: " << std::quoted(std::string(results[0].second, s.end())) << "\n";
            }
        }
    
        std::cout << "=== token iteration:\n";
        for (boost::sregex_token_iterator p(s.begin(), s.end(), re), e; p != e;
             ++p)
        {
            boost::sub_match<It> const& current = *p;
            if (current.matched) {
                std::cout << std::quoted(current.str()) << '\n';
            } else {
                std::cout << "non matching" << '\n';
            }
        }
    }
    

    Prints

    === regex_search:
    ? 8==2 : true ! 
    remain: "false;? 9==3 : 'book' ! 'library';"
    ? 9==3 : 'book' ! 
    remain: "'library';"
    === token iteration:
    "? 8==2 : true ! "
    "? 9==3 : 'book' ! "
    

    BONUS: Parser Expressions

    Instead of abusing regexen to do parsing, you could generate a parser, e.g. using Boost Spirit:

    Live On Coliru

    #include <boost/spirit/home/x3.hpp>
    #include <boost/fusion/adapted.hpp>
    #include <iomanip>
    #include <iostream>
    namespace x3 = boost::spirit::x3;
    
    int main() {
        std::string const s = 
            "? 8==2 : true ! false;"
            "? 9==3 : 'book' ! 'library';";
    
        using expression = std::string;
        using ternary = std::tuple<expression, expression, expression>;
        std::vector<ternary> parsed;
    
        auto expr_ = x3::lexeme [+(x3::graph - ';')];
        auto ternary_ = "?" >> expr_ >> ":" >> expr_ >> "!" >> expr_;
    
        std::cout << "=== parser approach:\n";
        if (x3::phrase_parse(begin(s), end(s), *x3::seek[ ternary_ ], x3::space, parsed)) {
    
            for (auto [cond, e1, e2] : parsed) {
                std::cout
                    << " condition " << std::quoted(cond) << "\n"
                    << " true expression " << std::quoted(e1) << "\n"
                    << " else expression " << std::quoted(e2) << "\n"
                    << "\n";
            }
        } else {
            std::cout << "non matching" << '\n';
        }
    }
    

    Prints

    === parser approach:
     condition "8==2"
     true expression "true"
     else expression "false"
    
     condition "9==3"
     true expression "'book'"
     else expression "'library'"
    

    This is much more extensible, will easily support recursive grammars and will be able to synthesize a typed representation of your syntax tree, instead of just leaving you with scattered bits of string.