I'm trying to use Spirit to parse expressions of the form Module1.Module2.value
(any number of dot-separated capitalized identifiers, then a dot, then a lowercase OCaml-style identifier). My current definition of the parser looks like this:
using namespace boost::spirit::qi;
template <typename Iter=std::string::iterator>
struct value_path : grammar<Iter, boost::tuple<std::vector<std::string>, std::string>()> {
value_path() :
value_path::base_type(start)
{
start = -(module_path<Iter>() >> '.') >> value_name<Iter>();
}
rule<Iter, boost::tuple<std::vector<std::string>, std::string>()> start;
};
where module_path
and value_name
are similar template structs inherting from qi::grammar
with single start
field that is assigned some Spirit rule, possibly using other custom grammars (e.g. value_name
depends on lowercase_ident
and operator_name
which are defined analogously) in the constructor.
When attempting to parse_phrase()
with this grammar, the program segfaults somewhere in the internals of Spirit (according to gdb). The equivalent definition, where the constructor of value_path
is as follows (I've basically unrolled all custom grammars it depends on, leaving only builtin Spirit parsers, and attempted to make it readable, which in hindsight was a fool's errand):
start =
-((raw[upper >> *(alnum | char_('_') | char_('\''))] % '.') >> '.')
>> lexeme[((lower | char_('_')) >> *(alnum | char_('_') | char_('\'')))
| char_('(') >>
( ( (char_('!') >> *char_("-+!$%&*./:<=>?@^|~")
| (char_("~?") >> +char_("-+!$%&*./:<=>?@^|~"))
| ( (char_("-+=<>@^|&*/$%") >> *char_("-+!$%&*./:<=>?@^|~"))
| string("mod")
| string("lor")
| string("lsl")
| string("lsr")
| string("asr")
| string("or")
| string("-.")
| string("!=")
| string("||")
| string("&&")
| string(":=")
| char_("*+=<>&-")
)
) >> char_(')')
)
)
];
does not segfault, and appears to work correctly, however I would rather avoid something this verbose and unreadable in my code. It's also not extensible at all.
So far I've tried various combinations of .alias()
, as well as saving value_name<Iter>()
, module_path<Iter>()
and all intermediate grammars along the dependency chain into their own fields. Neither of those worked. How can I keep the high level of abstraction of the first example? Is there a standard way of composing grammars in Spirit that does not run into issues?
You're running into trouble because expression templates keep internal references to temporaries.
Simply aggregate the sub-parser instances:
template <typename Iter=std::string::iterator>
struct value_path : grammar<Iter, boost::tuple<std::vector<std::string>, std::string>()> {
value_path() : value_path::base_type(start)
{
start = -(module_path_ >> '.') >> value_name_;
}
private:
rule<Iter, boost::tuple<std::vector<std::string>, std::string>()> start;
module_path<Iter> module_path_;
value_name<Iter> value_name_;
};
Notes I feel it might be a design smell to use separate sub-grammars for such small items. Although grammar decomposition is frequently a good idea to keep build times manageable and code size somewhat lower, but it seems - from the description here - you might be overdoing things.
The "plastering" of parser expressions behind a qi::rule
(effectively type erasure) comes with a possibly significant runtime overhead. If you subsequently instantiate those for more than a single iterator type, you may be compounding this with unnecessary growth of the binary.
UPDATE Regarding the idiomatic way to compose your grammars in Spirit, here's my take:
using namespace ascii;
using qi::raw;
lowercase_ident = raw[ (lower | '_') >> *(alnum | '_' | '\'') ];
module_path_item = raw[ upper >> *(alnum | '_' | '\'') ];
module_path_ = module_path_item % '.';
auto special_char = boost::proto::deep_copy(char_("-+!$%&*./:<=>?@^|~"));
operator_name = qi::raw [
('!' >> *special_char) /* branch 1 */
| (char_("~?") >> +special_char) /* branch 2 */
| (!char_(".:") >> special_char >> *special_char) /* branch 3 */
| "mod" /* branch 4 */
| "lor" | "lsl" | "lsr" | "asr" | "or" /* branch 5-9 */
| "-." /* branch 10 doesn't match because of branch 3 */
| "!=" | "||" | "&&" | ":=" /* branch 11-14 doesn't match because of branch 1,3 */
// | (special_char - char_("!$%./:?@^|~")) /* "*+=<>&-" cannot match because of branch 3 */
]
;
value_name_ =
lowercase_ident
| '(' >> operator_name >> ')'
;
start = -(module_path_ >> '.') >> value_name_;
Where the rules are fields declared as:
qi::rule<Iter, ast::value_path(), Skipper> start;
qi::rule<Iter, ast::module_path(), Skipper> module_path_;
// lexeme: (no skipper)
qi::rule<Iter, std::string()> value_name_, module_path_item, lowercase_ident, operator_name;
Notes:
value_path
grammar didn't use one, any skipper you passed into qi::phrase_parse
was being ignoredqi::lexeme[]
qi::raw
. This allows us to write grammars more succinctly (using '!'
instead of char_('!')
, "mod"
instead of qi::string("mod")
). Note that bare literals are implicitly transformed into "non-capturing" qi::lit(...)
nodes in the context of a Qi parser expression, but since we used raw[]
anyways, the fact that lit
doesn't capture an attribute is not a problem.I think this results in a perfectly cromulent grammar definition that should satisfy your criteria for "high-level". There's some wtf-y-ness with the grammar itself (regardless of its expression any parser generator language, likely):
operator_name
rule by removing nesting of alternative branches that will result in the same effect as the simplified flat alternative listspecial_chars
In alternative branch 3, e.g., I've noted the exceptions with a negative assertion:
(!char_(".:") >> special_char >> *special_char) /* branch 3 */
The !char_(".:")
assertion says: when the input wouldn't match '.'
or ':'
continue matching (any sequence of special characters). In fact you could equivalently write this as:
((special_char - '.' - ':') >> *special_char) /* branch 3 */
or even, as I ended up writing it:
(!char_(".:") >> +special_char) /* branch 3 */
The simplification of the branches actually raises the level of abstraction! It becomes clear now, that some of the branches will never match, because earlier branches match the input by definition:
| "-." /* branch 10 doesn't match because of branch 3 */
| "!=" | "||" | "&&" | ":=" /* branch 11-14 doesn't match because of branch 1,3 */
// | (special_char - char_("!$%./:?@^|~")) /* "*+=<>&-" cannot match because of branch 3 */
I hope you can see why I qualify this part of the grammar as "a little bit wtf-y" :) I'll assume for now that you got confused or something went wrong when you reduces it to a single rules (your "fool's errand").
Some further improvements to be noted:
boost::tuple<>
to make the code more legibleusing namespace
. This is generally a bad idea. And with Spirit it is frequently a very bad idea (it can lead to ambiguities that are unsolvable, or to very hard to spot errors). As you can see, it doesn't necessarily lead to very verbose code.#define BOOST_SPIRIT_DEBUG
#include <boost/spirit/include/qi.hpp>
#include <boost/fusion/adapted.hpp>
namespace qi = boost::spirit::qi;
namespace ascii = boost::spirit::ascii;
namespace ast {
using module_path = std::vector<std::string>;
struct value_path {
module_path module;
std::string value_expr;
};
}
BOOST_FUSION_ADAPT_STRUCT(ast::value_path, (ast::module_path, module)(std::string,value_expr))
template <typename Iter, typename Skipper = ascii::space_type>
struct value_path : qi::grammar<Iter, ast::value_path(), Skipper> {
value_path() : value_path::base_type(start)
{
using namespace ascii;
using qi::raw;
lowercase_ident = raw[ (lower | '_') >> *(alnum | '_' | '\'') ];
module_path_item = raw[ upper >> *(alnum | '_' | '\'') ];
module_path_ = module_path_item % '.';
auto special_char = boost::proto::deep_copy(char_("-+!$%&*./:<=>?@^|~"));
operator_name = qi::raw [
('!' >> *special_char) /* branch 1 */
| (char_("~?") >> +special_char) /* branch 2 */
| (!char_(".:") >> +special_char) /* branch 3 */
| "mod" /* branch 4 */
| "lor" | "lsl" | "lsr" | "asr" | "or" /* branch 5-9 */
| "-." /* branch 10 doesn't match because of branch 3 */
| "!=" | "||" | "&&" | ":=" /* branch 11-14 doesn't match because of branch 1,3 */
// | (special_char - char_("!$%./:?@^|~")) /* "*+=<>&-" cannot match because of branch 3 */
]
;
value_name_ =
lowercase_ident
| '(' >> operator_name >> ')'
;
start = -(module_path_ >> '.') >> value_name_;
BOOST_SPIRIT_DEBUG_NODES((start)(module_path_)(value_name_)(module_path_item)(lowercase_ident)(operator_name))
}
private:
qi::rule<Iter, ast::value_path(), Skipper> start;
qi::rule<Iter, ast::module_path(), Skipper> module_path_;
// lexeme: (no skipper)
qi::rule<Iter, std::string()> value_name_, module_path_item, lowercase_ident, operator_name;
};
int main()
{
for (std::string const input : {
"Some.Module.Package.ident",
"ident",
"A.B.C_.mod", // as lowercase_ident
"A.B.C_.(mod)", // as operator_name (branch 4)
"A.B.C_.(!=)", // as operator_name (branch 1)
"(!)" // as operator_name (branch 1)
})
{
std::cout << "--------------------------------------------------------------\n";
std::cout << "Parsing '" << input << "'\n";
using It = std::string::const_iterator;
It f(input.begin()), l(input.end());
value_path<It> g;
ast::value_path data;
bool ok = qi::phrase_parse(f, l, g, ascii::space, data);
if (ok) {
std::cout << "Parse succeeded\n";
} else {
std::cout << "Parse failed\n";
}
if (f!=l)
std::cout << "Remaining unparsed: '" << std::string(f,l) << "'\n";
}
}
--------------------------------------------------------------
Parsing 'Some.Module.Package.ident'
<start>
<try>Some.Module.Package.</try>
<module_path_>
<try>Some.Module.Package.</try>
<module_path_item>
<try>Some.Module.Package.</try>
<success>.Module.Package.iden</success>
<attributes>[[S, o, m, e]]</attributes>
</module_path_item>
<module_path_item>
<try>Module.Package.ident</try>
<success>.Package.ident</success>
<attributes>[[M, o, d, u, l, e]]</attributes>
</module_path_item>
<module_path_item>
<try>Package.ident</try>
<success>.ident</success>
<attributes>[[P, a, c, k, a, g, e]]</attributes>
</module_path_item>
<module_path_item>
<try>ident</try>
<fail/>
</module_path_item>
<success>.ident</success>
<attributes>[[[S, o, m, e], [M, o, d, u, l, e], [P, a, c, k, a, g, e]]]</attributes>
</module_path_>
<value_name_>
<try>ident</try>
<lowercase_ident>
<try>ident</try>
<success></success>
<attributes>[[i, d, e, n, t]]</attributes>
</lowercase_ident>
<success></success>
<attributes>[[i, d, e, n, t]]</attributes>
</value_name_>
<success></success>
<attributes>[[[[S, o, m, e], [M, o, d, u, l, e], [P, a, c, k, a, g, e]], [i, d, e, n, t]]]</attributes>
</start>
Parse succeeded
--------------------------------------------------------------
Parsing 'ident'
<start>
<try>ident</try>
<module_path_>
<try>ident</try>
<module_path_item>
<try>ident</try>
<fail/>
</module_path_item>
<fail/>
</module_path_>
<value_name_>
<try>ident</try>
<lowercase_ident>
<try>ident</try>
<success></success>
<attributes>[[i, d, e, n, t]]</attributes>
</lowercase_ident>
<success></success>
<attributes>[[i, d, e, n, t]]</attributes>
</value_name_>
<success></success>
<attributes>[[[], [i, d, e, n, t]]]</attributes>
</start>
Parse succeeded
--------------------------------------------------------------
Parsing 'A.B.C_.mod'
<start>
<try>A.B.C_.mod</try>
<module_path_>
<try>A.B.C_.mod</try>
<module_path_item>
<try>A.B.C_.mod</try>
<success>.B.C_.mod</success>
<attributes>[[A]]</attributes>
</module_path_item>
<module_path_item>
<try>B.C_.mod</try>
<success>.C_.mod</success>
<attributes>[[B]]</attributes>
</module_path_item>
<module_path_item>
<try>C_.mod</try>
<success>.mod</success>
<attributes>[[C, _]]</attributes>
</module_path_item>
<module_path_item>
<try>mod</try>
<fail/>
</module_path_item>
<success>.mod</success>
<attributes>[[[A], [B], [C, _]]]</attributes>
</module_path_>
<value_name_>
<try>mod</try>
<lowercase_ident>
<try>mod</try>
<success></success>
<attributes>[[m, o, d]]</attributes>
</lowercase_ident>
<success></success>
<attributes>[[m, o, d]]</attributes>
</value_name_>
<success></success>
<attributes>[[[[A], [B], [C, _]], [m, o, d]]]</attributes>
</start>
Parse succeeded
--------------------------------------------------------------
Parsing 'A.B.C_.(mod)'
<start>
<try>A.B.C_.(mod)</try>
<module_path_>
<try>A.B.C_.(mod)</try>
<module_path_item>
<try>A.B.C_.(mod)</try>
<success>.B.C_.(mod)</success>
<attributes>[[A]]</attributes>
</module_path_item>
<module_path_item>
<try>B.C_.(mod)</try>
<success>.C_.(mod)</success>
<attributes>[[B]]</attributes>
</module_path_item>
<module_path_item>
<try>C_.(mod)</try>
<success>.(mod)</success>
<attributes>[[C, _]]</attributes>
</module_path_item>
<module_path_item>
<try>(mod)</try>
<fail/>
</module_path_item>
<success>.(mod)</success>
<attributes>[[[A], [B], [C, _]]]</attributes>
</module_path_>
<value_name_>
<try>(mod)</try>
<lowercase_ident>
<try>(mod)</try>
<fail/>
</lowercase_ident>
<operator_name>
<try>mod)</try>
<success>)</success>
<attributes>[[m, o, d]]</attributes>
</operator_name>
<success></success>
<attributes>[[m, o, d]]</attributes>
</value_name_>
<success></success>
<attributes>[[[[A], [B], [C, _]], [m, o, d]]]</attributes>
</start>
Parse succeeded
--------------------------------------------------------------
Parsing 'A.B.C_.(!=)'
<start>
<try>A.B.C_.(!=)</try>
<module_path_>
<try>A.B.C_.(!=)</try>
<module_path_item>
<try>A.B.C_.(!=)</try>
<success>.B.C_.(!=)</success>
<attributes>[[A]]</attributes>
</module_path_item>
<module_path_item>
<try>B.C_.(!=)</try>
<success>.C_.(!=)</success>
<attributes>[[B]]</attributes>
</module_path_item>
<module_path_item>
<try>C_.(!=)</try>
<success>.(!=)</success>
<attributes>[[C, _]]</attributes>
</module_path_item>
<module_path_item>
<try>(!=)</try>
<fail/>
</module_path_item>
<success>.(!=)</success>
<attributes>[[[A], [B], [C, _]]]</attributes>
</module_path_>
<value_name_>
<try>(!=)</try>
<lowercase_ident>
<try>(!=)</try>
<fail/>
</lowercase_ident>
<operator_name>
<try>!=)</try>
<success>)</success>
<attributes>[[!, =]]</attributes>
</operator_name>
<success></success>
<attributes>[[!, =]]</attributes>
</value_name_>
<success></success>
<attributes>[[[[A], [B], [C, _]], [!, =]]]</attributes>
</start>
Parse succeeded
--------------------------------------------------------------
Parsing '(!)'
<start>
<try>(!)</try>
<module_path_>
<try>(!)</try>
<module_path_item>
<try>(!)</try>
<fail/>
</module_path_item>
<fail/>
</module_path_>
<value_name_>
<try>(!)</try>
<lowercase_ident>
<try>(!)</try>
<fail/>
</lowercase_ident>
<operator_name>
<try>!)</try>
<success>)</success>
<attributes>[[!]]</attributes>
</operator_name>
<success></success>
<attributes>[[!]]</attributes>
</value_name_>
<success></success>
<attributes>[[[], [!]]]</attributes>
</start>
Parse succeeded