I would like to write a grammar (highly simplified) with:
grr := integer [ . integer ]
with
integer ::= digit { [ underline ] digit }
Since the parsed literals are needed again later (the real grammar is more complex, not everything can be converted to a number immediately) the literal must be stored completely as string (more precisely as iterator_range) in the AST for later use (with underline).
The problem now is that the literal expressions can be longer than they should be (regarding the implementation/computation etc. later). The obvious solution is the repeat
directive (here detailed for Qi repeat or very short for X3).
This is where my problems start (coliru):
for(std::string_view const s : {
// ok
"0", "10", "1_0", "012345",
// too long
"0123456",
"1_2_3_4_5_6_7_8_9_0",
// absolutely invalid
"1_2_3_4_5_6_", "_0123_456", ""
}) {
auto const cs = x3::char_("0-9");
std::string attr;
bool const ok = x3::parse(std::begin(s), std::end(s),
x3::raw[ cs >> x3::repeat(0, 5)[ ('_' >> cs) | cs] ],
attr);
cout << s << " -> " << attr
<< " (" << std::boolalpha << ok << ")"
<< "\n";
}
gives
0 -> 0 (true)
10 -> 10 (true)
1_0 -> 1_0 (true)
012345 -> 012345 (true)
0123456 -> 012345 (true)
1_2_3_4_5_6_7_8_9_0 -> 1_2_3_4_5_6 (true)
1_2_3_4_5_6_ -> 1_2_3_4_5_6 (true)
_0123_456 -> (false)
-> (false)
If the literal is too long, the parser should fail, which it does not. If it ends with an underline, it should do that too - but it doesn't. Underline at the beginning and empty literals are correctly recognized/parsed as false.
Meanwhile, I try to write the more complex parsers into a separate parser classes, but here I am e.g. missing the rule to recognize the literal ending with an underline....
Furthermore, BOOST_SPIRIT_X3_DEBUG seems to be broken all of a sudden - there is no output.
What is the solution to my problem? I'm out of ideas except absolutely low-level and complicated via iterator, counter, etc.
This problem also affects other rules to be implemented.
If the literal is too long, the parser should fail
Where does it say that? It looks like the code does exactly what you ask: it parses at most 6 digits with the requisite underscores. The output even confirms that it does exactly that.
You can of course make it much more apparent by showing what was not parsed:
auto f = begin(s), l = end(s);
bool const ok = x3::parse(
f, l, x3::raw[cs >> x3::repeat(0, 5)[('_' >> cs) | cs]], attr);
fmt::print(
"{:21} -> {:5} {:13} remaining '{}'\n",
fmt::format("'{}'", s),
ok,
fmt::format("'{}'", attr),
std::string(f, l));
Prints
'0' -> true '0' remaining ''
'10' -> true '10' remaining ''
'1_0' -> true '1_0' remaining ''
'012345' -> true '012345' remaining ''
'0123456' -> true '012345' remaining '6'
'1_2_3_4_5_6_7_8_9_0' -> true '1_2_3_4_5_6' remaining '_7_8_9_0'
'1_2_3_4_5_6_' -> true '1_2_3_4_5_6' remaining '_'
'_0123_456' -> false '' remaining '_0123_456'
'' -> false '' remaining ''
To assert that a complete input be parsed, use either x3::eoi
or check the iterators:
bool const ok = x3::parse(
f,
l,
x3::raw[cs >> x3::repeat(0, 5)[('_' >> cs) | cs]] >> x3::eoi,
attr);
Prints
'0' -> true '0' remaining ''
'10' -> true '10' remaining ''
'1_0' -> true '1_0' remaining ''
'012345' -> true '012345' remaining ''
'0123456' -> false '012345' remaining '0123456'
'1_2_3_4_5_6_7_8_9_0' -> false '1_2_3_4_5_6' remaining '1_2_3_4_5_6_7_8_9_0'
'1_2_3_4_5_6_' -> false '1_2_3_4_5_6' remaining '1_2_3_4_5_6_'
'_0123_456' -> false '' remaining '_0123_456'
'' -> false '' remaining ''
If instead you want to allow the input to continue, just not with certain characters, e.g. parsing many such "numbers":
auto const number = x3::lexeme[ //
x3::raw[cs >> x3::repeat(0, 5)[('_' >> cs) | cs]]
// within the lexeme, assert that no digit or _ follows
>> ! (cs | '_') //
];
//#define BOOST_SPIRIT_X3_DEBUG
#include <boost/spirit/home/x3.hpp>
#include <fmt/ranges.h>
using namespace std::string_view_literals;
namespace Parser {
namespace x3 = boost::spirit::x3;
auto const cs = x3::digit;
auto const number = x3::lexeme[ //
x3::raw[cs >> x3::repeat(0, 5)[('_' >> cs) | cs]]
// within the lexeme, assert that no digit or _ follows
>> ! (cs | '_') //
];
auto const ws_or_comment = x3::space | "//" >> *~x3::char_("\r\n");
auto const numbers = x3::skip(ws_or_comment)[number % ','];
} // namespace Parser
int main()
{
std::vector<std::string> attr;
std::string_view const s =
R"(0,
10,
1_0,
012345,
// too long
0123456,
1_2_3_4_5_6_7_8_9_0,
// absolutely invalid
1_2_3_4_5_6_,
_0123_456)"sv;
auto f = begin(s), l = end(s);
bool const ok = parse(f, l, Parser::numbers, attr);
fmt::print("{}: {}\nremaining '{}'\n", ok, attr, std::string(f, l));
}
Prints
true: ["0", "10", "1_0", "012345"]
remaining ',
// too long
0123456,
1_2_3_4_5_6_7_8_9_0,
// absolutely invalid
1_2_3_4_5_6_,
_0123_456'
To drive home the point of checking inside the lexeme in the presence of otherwise insignificant whitespace:
auto const numbers = x3::skip(ws_or_comment)[*number];
With a slightly adjusted test input (removing the commas):
//#define BOOST_SPIRIT_X3_DEBUG
#include <boost/spirit/home/x3.hpp>
#include <fmt/ranges.h>
using namespace std::string_view_literals;
namespace Parser {
namespace x3 = boost::spirit::x3;
auto const cs = x3::digit;
auto const number = x3::lexeme[ //
x3::raw[cs >> x3::repeat(0, 5)[('_' >> cs) | cs]]
// within the lexeme, assert that no digit or _ follows
>> ! (cs | '_') //
];
auto const ws_or_comment = x3::space | "//" >> *~x3::char_("\r\n");
auto const numbers = x3::skip(ws_or_comment)[*number];
} // namespace Parser
int main()
{
std::vector<std::string> attr;
std::string_view const s =
R"(0
10
1_0
012345
// too long
0123456
1_2_3_4_5_6_7_8_9_0
// absolutely invalid
1_2_3_4_5_6_
_0123_456)"sv;
auto f = begin(s), l = end(s);
bool const ok = parse(f, l, Parser::numbers, attr);
fmt::print("{}: {}\nremaining '{}'\n", ok, attr, std::string(f, l));
}
Prints
true: ["0", "10", "1_0", "012345"]
remaining '0123456
1_2_3_4_5_6_7_8_9_0
// absolutely invalid
1_2_3_4_5_6_
_0123_456'