Search code examples
c++regexboost-spiritbacktrackingqi

How to write a boost::spirit::qi parser to do what '?' does in regex?


Let's say we have a regex "start:(?: ([0-9]{1,2}))? ([0-9].*)".

It will match

std::string string1 = "start: 01 0ab";

and

std::string string2 = "start: 0ab";

We can also get the 2 matched string respectively.

I try to use boost::spirit::qi parser to parse string2 but it couldn't match.

qi::rule<std::string::const_iterator, std::string()> rule1 = qi::repeat(1,2)[qi::digit];
qi::rule<std::string::const_iterator, std::string()> rule2 = qi::digit >> *qi::char_;
std::vector<std::string> attr;
auto it_begin = string2.begin();
auto it_end = string2.end();
if (qi::parse(
    it_begin,
    it_end,
    qi::lit("start:")
         >> -(qi::lit(" ") >> rule1)
         >> qi::lit(" ") >> rule2
         >> qi::eoi,
    attr))
    std::cout<<"match"<<std::endl;
else
    std::cout<<"not match"<<std::endl;

We can of course use a look-ahead operator to check what's behind rule1, but is there a more generic approach to implement regex operator '?' ? Thanks!


Solution

  • I'm not sure what's wrong with the expectation. It is the only way for otherwise ambiguous rules, since PEG grammars are always greedy.

    However, maybe you didn't arrive at the most elegant form, since you were looking for something "better". Here's what I'd do.

    I'd use a skipper to match spaces¹:

        if (qi::phrase_parse(it_begin, it_end,
                    "start:" >> -rule1 >> rule2 >> qi::eoi,
                    qi::space, attr))
    

    Where the rules are still lexemes (because there were declared without the skipper):

    qi::rule<It, std::string()> const 
        rule1 = qi::digit >> qi::digit >> &qi::space,
        rule2 = qi::digit >> *qi::graph;
    

    Note qi::graph doesn't match whitespace, where *qi::char_ simply matches anything at all greedily.

    Live On Coliru

    #include <boost/spirit/include/qi.hpp>
    namespace qi = boost::spirit::qi;
    
    int main() {
        using It = std::string::const_iterator;
    
        // implicitly lexemes (no skipper in rule declaration)
        qi::rule<It, std::string()> const 
            rule1 = qi::digit >> qi::digit >> &qi::space,
            rule2 = qi::digit >> *qi::graph;
    
        for (std::string const input : { "start: 01 0ab", "start: 0ab", }) {
            std::vector<std::string> attr;
    
            auto it_begin = input.begin();
            auto it_end   = input.end();
    
            if (qi::phrase_parse(it_begin, it_end, "start:" >> -rule1 >> rule2 >> qi::eoi, qi::space, attr))
                std::cout << "match\n";
            else
                std::cout << "not match\n";
    
            if (it_begin!=it_end)
                std::cout<<"Remaining unparsed input: '" << std::string(it_begin, it_end) << "'\n";
        }
    }
    

    Prints

    match
    match
    

    ¹ this assumes that multiple/different whitespace is okay. If newlines should not count as whitespace, use qi::blank instead of qi::space