Search code examples
c++boost-spiritboost-spirit-x3

Misunderstanding repeat directive - it should fail, but doesn't


I would like to write a grammar (highly simplified) with:

grr := integer [ . integer ]

with

integer ::= digit { [ underline ] digit }

Since the parsed literals are needed again later (the real grammar is more complex, not everything can be converted to a number immediately) the literal must be stored completely as string (more precisely as iterator_range) in the AST for later use (with underline).

The problem now is that the literal expressions can be longer than they should be (regarding the implementation/computation etc. later). The obvious solution is the repeat directive (here detailed for Qi repeat or very short for X3).

This is where my problems start (coliru):

    for(std::string_view const s : {
        // ok
        "0", "10", "1_0", "012345", 
        // too long
        "0123456",
        "1_2_3_4_5_6_7_8_9_0", 
        // absolutely invalid
        "1_2_3_4_5_6_", "_0123_456", ""
    }) {
        auto const cs = x3::char_("0-9");
        std::string attr;
        bool const ok = x3::parse(std::begin(s), std::end(s), 
            x3::raw[ cs >> x3::repeat(0, 5)[ ('_' >> cs) | cs] ],
            attr);
        cout << s << " -> " << attr 
             << " (" << std::boolalpha << ok << ")"
             << "\n";   
    }

gives

0 -> 0 (true)
10 -> 10 (true)
1_0 -> 1_0 (true)
012345 -> 012345 (true)
0123456 -> 012345 (true)
1_2_3_4_5_6_7_8_9_0 -> 1_2_3_4_5_6 (true)
1_2_3_4_5_6_ -> 1_2_3_4_5_6 (true)
_0123_456 ->  (false)
 ->  (false)

If the literal is too long, the parser should fail, which it does not. If it ends with an underline, it should do that too - but it doesn't. Underline at the beginning and empty literals are correctly recognized/parsed as false.

Meanwhile, I try to write the more complex parsers into a separate parser classes, but here I am e.g. missing the rule to recognize the literal ending with an underline....

Furthermore, BOOST_SPIRIT_X3_DEBUG seems to be broken all of a sudden - there is no output.

What is the solution to my problem? I'm out of ideas except absolutely low-level and complicated via iterator, counter, etc.

This problem also affects other rules to be implemented.


Solution

  • If the literal is too long, the parser should fail

    Where does it say that? It looks like the code does exactly what you ask: it parses at most 6 digits with the requisite underscores. The output even confirms that it does exactly that.

    You can of course make it much more apparent by showing what was not parsed:

    Live On Coliru

    auto f = begin(s), l = end(s);
    bool const ok = x3::parse(
        f, l, x3::raw[cs >> x3::repeat(0, 5)[('_' >> cs) | cs]], attr);
    
    fmt::print(
        "{:21} -> {:5} {:13} remaining '{}'\n",
        fmt::format("'{}'", s),
        ok,
        fmt::format("'{}'", attr),
        std::string(f, l));
    

    Prints

    '0'                   -> true  '0'           remaining ''
    '10'                  -> true  '10'          remaining ''
    '1_0'                 -> true  '1_0'         remaining ''
    '012345'              -> true  '012345'      remaining ''
    '0123456'             -> true  '012345'      remaining '6'
    '1_2_3_4_5_6_7_8_9_0' -> true  '1_2_3_4_5_6' remaining '_7_8_9_0'
    '1_2_3_4_5_6_'        -> true  '1_2_3_4_5_6' remaining '_'
    '_0123_456'           -> false ''            remaining '_0123_456'
    ''                    -> false ''            remaining ''
    

    Fixes

    To assert that a complete input be parsed, use either x3::eoi or check the iterators:

    Live On Coliru

    bool const ok = x3::parse(
        f,
        l,
        x3::raw[cs >> x3::repeat(0, 5)[('_' >> cs) | cs]] >> x3::eoi,
        attr);
    

    Prints

    '0'                   -> true  '0'           remaining ''
    '10'                  -> true  '10'          remaining ''
    '1_0'                 -> true  '1_0'         remaining ''
    '012345'              -> true  '012345'      remaining ''
    '0123456'             -> false '012345'      remaining '0123456'
    '1_2_3_4_5_6_7_8_9_0' -> false '1_2_3_4_5_6' remaining '1_2_3_4_5_6_7_8_9_0'
    '1_2_3_4_5_6_'        -> false '1_2_3_4_5_6' remaining '1_2_3_4_5_6_'
    '_0123_456'           -> false ''            remaining '_0123_456'
    ''                    -> false ''            remaining ''
    

    Distinct Lexemes

    If instead you want to allow the input to continue, just not with certain characters, e.g. parsing many such "numbers":

    auto const number = x3::lexeme[ //
        x3::raw[cs >> x3::repeat(0, 5)[('_' >> cs) | cs]]
        // within the lexeme, assert that no digit or _ follows
        >> ! (cs | '_') //
    ];
    

    Live On Coliru

    //#define BOOST_SPIRIT_X3_DEBUG
    #include <boost/spirit/home/x3.hpp>
    #include <fmt/ranges.h>
    using namespace std::string_view_literals;
    
    namespace Parser {
        namespace x3 = boost::spirit::x3;
        auto const cs = x3::digit;
        auto const number = x3::lexeme[ //
            x3::raw[cs >> x3::repeat(0, 5)[('_' >> cs) | cs]]
            // within the lexeme, assert that no digit or _ follows
            >> ! (cs | '_') //
        ];
        auto const ws_or_comment = x3::space | "//" >> *~x3::char_("\r\n");
        auto const numbers = x3::skip(ws_or_comment)[number % ','];
    } // namespace Parser
    
    int main()
    {
        std::vector<std::string> attr;
        std::string_view const s =
            R"(0,
               10,
               1_0,
               012345,
               // too long
               0123456,
               1_2_3_4_5_6_7_8_9_0,
               // absolutely invalid
               1_2_3_4_5_6_,
               _0123_456)"sv;
    
        auto f = begin(s), l = end(s);
        bool const ok = parse(f, l, Parser::numbers, attr);
    
        fmt::print("{}: {}\nremaining '{}'\n", ok, attr, std::string(f, l));
    }
    

    Prints

    true: ["0", "10", "1_0", "012345"]
    remaining ',
               // too long
               0123456,
               1_2_3_4_5_6_7_8_9_0,
               // absolutely invalid
               1_2_3_4_5_6_,
               _0123_456'
    

    Proving It

    To drive home the point of checking inside the lexeme in the presence of otherwise insignificant whitespace:

    auto const numbers = x3::skip(ws_or_comment)[*number];
    

    With a slightly adjusted test input (removing the commas):

    Live On Coliru

    //#define BOOST_SPIRIT_X3_DEBUG
    #include <boost/spirit/home/x3.hpp>
    #include <fmt/ranges.h>
    using namespace std::string_view_literals;
    
    namespace Parser {
        namespace x3 = boost::spirit::x3;
        auto const cs = x3::digit;
        auto const number = x3::lexeme[ //
            x3::raw[cs >> x3::repeat(0, 5)[('_' >> cs) | cs]]
            // within the lexeme, assert that no digit or _ follows
            >> ! (cs | '_') //
        ];
        auto const ws_or_comment = x3::space | "//" >> *~x3::char_("\r\n");
        auto const numbers = x3::skip(ws_or_comment)[*number];
    } // namespace Parser
    
    int main()
    {
        std::vector<std::string> attr;
        std::string_view const s =
            R"(0
               10
               1_0
               012345
               // too long
               0123456
               1_2_3_4_5_6_7_8_9_0
               // absolutely invalid
               1_2_3_4_5_6_
               _0123_456)"sv;
    
        auto f = begin(s), l = end(s);
        bool const ok = parse(f, l, Parser::numbers, attr);
    
        fmt::print("{}: {}\nremaining '{}'\n", ok, attr, std::string(f, l));
    }
    

    Prints

    true: ["0", "10", "1_0", "012345"]
    remaining '0123456
               1_2_3_4_5_6_7_8_9_0
               // absolutely invalid
               1_2_3_4_5_6_
               _0123_456'