Search code examples
c++parsingboostboost-spirit

Internal Boost::Spirit code segfaults when parsing a composite grammar


I'm trying to use Spirit to parse expressions of the form Module1.Module2.value (any number of dot-separated capitalized identifiers, then a dot, then a lowercase OCaml-style identifier). My current definition of the parser looks like this:

using namespace boost::spirit::qi;

template <typename Iter=std::string::iterator>
struct value_path : grammar<Iter, boost::tuple<std::vector<std::string>, std::string>()> {
    value_path() :
        value_path::base_type(start)
    {
        start = -(module_path<Iter>() >> '.') >> value_name<Iter>();
    }
    rule<Iter, boost::tuple<std::vector<std::string>, std::string>()> start;
};

where module_path and value_name are similar template structs inherting from qi::grammar with single start field that is assigned some Spirit rule, possibly using other custom grammars (e.g. value_name depends on lowercase_ident and operator_name which are defined analogously) in the constructor.

When attempting to parse_phrase() with this grammar, the program segfaults somewhere in the internals of Spirit (according to gdb). The equivalent definition, where the constructor of value_path is as follows (I've basically unrolled all custom grammars it depends on, leaving only builtin Spirit parsers, and attempted to make it readable, which in hindsight was a fool's errand):

start =
-((raw[upper >> *(alnum | char_('_') | char_('\''))] % '.') >> '.')
>> lexeme[((lower | char_('_')) >> *(alnum | char_('_') | char_('\'')))
         | char_('(') >>
             ( ( (char_('!') >> *char_("-+!$%&*./:<=>?@^|~")
                 | (char_("~?") >> +char_("-+!$%&*./:<=>?@^|~"))
                 | ( (char_("-+=<>@^|&*/$%") >> *char_("-+!$%&*./:<=>?@^|~"))
                   | string("mod")
                   | string("lor")
                   | string("lsl")
                   | string("lsr")
                   | string("asr")
                   | string("or")
                   | string("-.")
                   | string("!=")
                   | string("||")
                   | string("&&")
                   | string(":=")
                   | char_("*+=<>&-")
                   )
                 ) >> char_(')')
               )
             )
         ];

does not segfault, and appears to work correctly, however I would rather avoid something this verbose and unreadable in my code. It's also not extensible at all.

So far I've tried various combinations of .alias(), as well as saving value_name<Iter>(), module_path<Iter>() and all intermediate grammars along the dependency chain into their own fields. Neither of those worked. How can I keep the high level of abstraction of the first example? Is there a standard way of composing grammars in Spirit that does not run into issues?


Solution

  • You're running into trouble because expression templates keep internal references to temporaries.

    Simply aggregate the sub-parser instances:

    template <typename Iter=std::string::iterator>
    struct value_path : grammar<Iter, boost::tuple<std::vector<std::string>, std::string>()> {
        value_path() : value_path::base_type(start)
        {
            start = -(module_path_ >> '.') >> value_name_;
        }
      private:
    
        rule<Iter, boost::tuple<std::vector<std::string>, std::string>()> start;
        module_path<Iter> module_path_;
        value_name<Iter> value_name_;
    };
    

    Notes I feel it might be a design smell to use separate sub-grammars for such small items. Although grammar decomposition is frequently a good idea to keep build times manageable and code size somewhat lower, but it seems - from the description here - you might be overdoing things.

    The "plastering" of parser expressions behind a qi::rule (effectively type erasure) comes with a possibly significant runtime overhead. If you subsequently instantiate those for more than a single iterator type, you may be compounding this with unnecessary growth of the binary.

    UPDATE Regarding the idiomatic way to compose your grammars in Spirit, here's my take:

    Live On Coliru

    using namespace ascii;
    using qi::raw;
    
    lowercase_ident  = raw[ (lower | '_') >> *(alnum | '_' | '\'') ];
    module_path_item = raw[ upper >> *(alnum | '_' | '\'') ];
    module_path_     = module_path_item % '.';
    
    auto special_char = boost::proto::deep_copy(char_("-+!$%&*./:<=>?@^|~"));
    
    operator_name = qi::raw [
              ('!' >> *special_char)                          /* branch 1     */
            | (char_("~?") >> +special_char)                  /* branch 2     */
            | (!char_(".:") >> special_char >> *special_char) /* branch 3     */
            | "mod"                                           /* branch 4     */
            | "lor" | "lsl" | "lsr" | "asr" | "or"            /* branch 5-9   */
            | "-."                                            /* branch 10    doesn't match because of branch 3   */
            | "!=" | "||" | "&&" | ":="                       /* branch 11-14 doesn't match because of branch 1,3 */
         // | (special_char - char_("!$%./:?@^|~"))           /* "*+=<>&-" cannot match because of branch 3 */
        ]
        ;
    
    value_name_  = 
          lowercase_ident
        | '(' >> operator_name >> ')'
        ;
    
    start = -(module_path_ >> '.') >> value_name_;
    

    Where the rules are fields declared as:

    qi::rule<Iter, ast::value_path(),  Skipper> start;
    qi::rule<Iter, ast::module_path(), Skipper> module_path_;
    
    // lexeme: (no skipper)
    qi::rule<Iter, std::string()> value_name_, module_path_item, lowercase_ident, operator_name;
    

    Notes:

    • I've added a skipper, because since your value_path grammar didn't use one, any skipper you passed into qi::phrase_parse was being ignored
    • The lexemes just drop the skipper from the rule declaration type, so you don't even need to specify qi::lexeme[]
    • In the lexemes, I copied your intention to just copy the parsed text verbatim using qi::raw. This allows us to write grammars more succinctly (using '!' instead of char_('!'), "mod" instead of qi::string("mod")). Note that bare literals are implicitly transformed into "non-capturing" qi::lit(...) nodes in the context of a Qi parser expression, but since we used raw[] anyways, the fact that lit doesn't capture an attribute is not a problem.

    I think this results in a perfectly cromulent grammar definition that should satisfy your criteria for "high-level". There's some wtf-y-ness with the grammar itself (regardless of its expression any parser generator language, likely):

    • I've simplified the operator_name rule by removing nesting of alternative branches that will result in the same effect as the simplified flat alternative list
    • I've refactored the "magic" lists of special characters into special_chars
    • In alternative branch 3, e.g., I've noted the exceptions with a negative assertion:

      (!char_(".:") >> special_char >> *special_char) /* branch 3     */
      

      The !char_(".:") assertion says: when the input wouldn't match '.' or ':' continue matching (any sequence of special characters). In fact you could equivalently write this as:

      ((special_char - '.' - ':') >> *special_char) /* branch 3     */
      

      or even, as I ended up writing it:

      (!char_(".:") >> +special_char) /* branch 3     */
      
    • The simplification of the branches actually raises the level of abstraction! It becomes clear now, that some of the branches will never match, because earlier branches match the input by definition:

         | "-."                                    /* branch 10    doesn't match because of branch 3   */
         | "!=" | "||" | "&&" | ":="               /* branch 11-14 doesn't match because of branch 1,3 */
      // | (special_char - char_("!$%./:?@^|~"))   /* "*+=<>&-" cannot match because of branch 3 */
      

    I hope you can see why I qualify this part of the grammar as "a little bit wtf-y" :) I'll assume for now that you got confused or something went wrong when you reduces it to a single rules (your "fool's errand").


    Some further improvements to be noted:

    • I've added a proper AST struct instead of the boost::tuple<> to make the code more legible
    • I've added BOOST_SPIRIT_DEBUG* macros so you can debug your grammar at a high level (the rule level)
    • I've ditched the blanket using namespace. This is generally a bad idea. And with Spirit it is frequently a very bad idea (it can lead to ambiguities that are unsolvable, or to very hard to spot errors). As you can see, it doesn't necessarily lead to very verbose code.

    Full Listing

    #define BOOST_SPIRIT_DEBUG
    #include <boost/spirit/include/qi.hpp>
    #include <boost/fusion/adapted.hpp>
    
    namespace qi    = boost::spirit::qi;
    namespace ascii = boost::spirit::ascii;
    
    namespace ast {
        using module_path = std::vector<std::string>;
        struct value_path {
            module_path module;
            std::string   value_expr;
        };
    }
    
    BOOST_FUSION_ADAPT_STRUCT(ast::value_path, (ast::module_path, module)(std::string,value_expr))
    
    template <typename Iter, typename Skipper = ascii::space_type>
    struct value_path : qi::grammar<Iter, ast::value_path(), Skipper> {
        value_path() : value_path::base_type(start)
        {
            using namespace ascii;
            using qi::raw;
    
            lowercase_ident  = raw[ (lower | '_') >> *(alnum | '_' | '\'') ];
            module_path_item = raw[ upper >> *(alnum | '_' | '\'') ];
            module_path_     = module_path_item % '.';
    
            auto special_char = boost::proto::deep_copy(char_("-+!$%&*./:<=>?@^|~"));
    
            operator_name = qi::raw [
                      ('!'          >> *special_char)         /* branch 1     */
                    | (char_("~?")  >> +special_char)         /* branch 2     */
                    | (!char_(".:") >> +special_char)         /* branch 3     */
                    | "mod"                                   /* branch 4     */
                    | "lor" | "lsl" | "lsr" | "asr" | "or"    /* branch 5-9   */
                    | "-."                                    /* branch 10    doesn't match because of branch 3   */
                    | "!=" | "||" | "&&" | ":="               /* branch 11-14 doesn't match because of branch 1,3 */
                 // | (special_char - char_("!$%./:?@^|~"))   /* "*+=<>&-" cannot match because of branch 3 */
                ]
                ;
    
            value_name_  = 
                  lowercase_ident
                | '(' >> operator_name >> ')'
                ;
    
            start = -(module_path_ >> '.') >> value_name_;
    
            BOOST_SPIRIT_DEBUG_NODES((start)(module_path_)(value_name_)(module_path_item)(lowercase_ident)(operator_name))
        }
      private:
        qi::rule<Iter, ast::value_path(),  Skipper> start;
        qi::rule<Iter, ast::module_path(), Skipper> module_path_;
    
        // lexeme: (no skipper)
        qi::rule<Iter, std::string()> value_name_, module_path_item, lowercase_ident, operator_name;
    };
    
    int main()
    {
        for (std::string const input : { 
                "Some.Module.Package.ident",
                "ident",
                "A.B.C_.mod",    // as lowercase_ident
                "A.B.C_.(mod)",  // as operator_name (branch 4)
                "A.B.C_.(!=)",   // as operator_name (branch 1)
                "(!)"            // as operator_name (branch 1)
                })
        {
            std::cout << "--------------------------------------------------------------\n";
            std::cout << "Parsing '" << input << "'\n";
    
            using It = std::string::const_iterator;
            It f(input.begin()), l(input.end());
    
            value_path<It> g;
            ast::value_path data;
            bool ok = qi::phrase_parse(f, l, g, ascii::space, data);
            if (ok) {
                std::cout << "Parse succeeded\n";
            } else {
                std::cout << "Parse failed\n";
            }
    
            if (f!=l)
                std::cout << "Remaining unparsed: '" << std::string(f,l) << "'\n";
        }
    }
    

    Debug Output

    --------------------------------------------------------------
    Parsing 'Some.Module.Package.ident'
    <start>
      <try>Some.Module.Package.</try>
      <module_path_>
        <try>Some.Module.Package.</try>
        <module_path_item>
          <try>Some.Module.Package.</try>
          <success>.Module.Package.iden</success>
          <attributes>[[S, o, m, e]]</attributes>
        </module_path_item>
        <module_path_item>
          <try>Module.Package.ident</try>
          <success>.Package.ident</success>
          <attributes>[[M, o, d, u, l, e]]</attributes>
        </module_path_item>
        <module_path_item>
          <try>Package.ident</try>
          <success>.ident</success>
          <attributes>[[P, a, c, k, a, g, e]]</attributes>
        </module_path_item>
        <module_path_item>
          <try>ident</try>
          <fail/>
        </module_path_item>
        <success>.ident</success>
        <attributes>[[[S, o, m, e], [M, o, d, u, l, e], [P, a, c, k, a, g, e]]]</attributes>
      </module_path_>
      <value_name_>
        <try>ident</try>
        <lowercase_ident>
          <try>ident</try>
          <success></success>
          <attributes>[[i, d, e, n, t]]</attributes>
        </lowercase_ident>
        <success></success>
        <attributes>[[i, d, e, n, t]]</attributes>
      </value_name_>
      <success></success>
      <attributes>[[[[S, o, m, e], [M, o, d, u, l, e], [P, a, c, k, a, g, e]], [i, d, e, n, t]]]</attributes>
    </start>
    Parse succeeded
    --------------------------------------------------------------
    Parsing 'ident'
    <start>
      <try>ident</try>
      <module_path_>
        <try>ident</try>
        <module_path_item>
          <try>ident</try>
          <fail/>
        </module_path_item>
        <fail/>
      </module_path_>
      <value_name_>
        <try>ident</try>
        <lowercase_ident>
          <try>ident</try>
          <success></success>
          <attributes>[[i, d, e, n, t]]</attributes>
        </lowercase_ident>
        <success></success>
        <attributes>[[i, d, e, n, t]]</attributes>
      </value_name_>
      <success></success>
      <attributes>[[[], [i, d, e, n, t]]]</attributes>
    </start>
    Parse succeeded
    --------------------------------------------------------------
    Parsing 'A.B.C_.mod'
    <start>
      <try>A.B.C_.mod</try>
      <module_path_>
        <try>A.B.C_.mod</try>
        <module_path_item>
          <try>A.B.C_.mod</try>
          <success>.B.C_.mod</success>
          <attributes>[[A]]</attributes>
        </module_path_item>
        <module_path_item>
          <try>B.C_.mod</try>
          <success>.C_.mod</success>
          <attributes>[[B]]</attributes>
        </module_path_item>
        <module_path_item>
          <try>C_.mod</try>
          <success>.mod</success>
          <attributes>[[C, _]]</attributes>
        </module_path_item>
        <module_path_item>
          <try>mod</try>
          <fail/>
        </module_path_item>
        <success>.mod</success>
        <attributes>[[[A], [B], [C, _]]]</attributes>
      </module_path_>
      <value_name_>
        <try>mod</try>
        <lowercase_ident>
          <try>mod</try>
          <success></success>
          <attributes>[[m, o, d]]</attributes>
        </lowercase_ident>
        <success></success>
        <attributes>[[m, o, d]]</attributes>
      </value_name_>
      <success></success>
      <attributes>[[[[A], [B], [C, _]], [m, o, d]]]</attributes>
    </start>
    Parse succeeded
    --------------------------------------------------------------
    Parsing 'A.B.C_.(mod)'
    <start>
      <try>A.B.C_.(mod)</try>
      <module_path_>
        <try>A.B.C_.(mod)</try>
        <module_path_item>
          <try>A.B.C_.(mod)</try>
          <success>.B.C_.(mod)</success>
          <attributes>[[A]]</attributes>
        </module_path_item>
        <module_path_item>
          <try>B.C_.(mod)</try>
          <success>.C_.(mod)</success>
          <attributes>[[B]]</attributes>
        </module_path_item>
        <module_path_item>
          <try>C_.(mod)</try>
          <success>.(mod)</success>
          <attributes>[[C, _]]</attributes>
        </module_path_item>
        <module_path_item>
          <try>(mod)</try>
          <fail/>
        </module_path_item>
        <success>.(mod)</success>
        <attributes>[[[A], [B], [C, _]]]</attributes>
      </module_path_>
      <value_name_>
        <try>(mod)</try>
        <lowercase_ident>
          <try>(mod)</try>
          <fail/>
        </lowercase_ident>
        <operator_name>
          <try>mod)</try>
          <success>)</success>
          <attributes>[[m, o, d]]</attributes>
        </operator_name>
        <success></success>
        <attributes>[[m, o, d]]</attributes>
      </value_name_>
      <success></success>
      <attributes>[[[[A], [B], [C, _]], [m, o, d]]]</attributes>
    </start>
    Parse succeeded
    --------------------------------------------------------------
    Parsing 'A.B.C_.(!=)'
    <start>
      <try>A.B.C_.(!=)</try>
      <module_path_>
        <try>A.B.C_.(!=)</try>
        <module_path_item>
          <try>A.B.C_.(!=)</try>
          <success>.B.C_.(!=)</success>
          <attributes>[[A]]</attributes>
        </module_path_item>
        <module_path_item>
          <try>B.C_.(!=)</try>
          <success>.C_.(!=)</success>
          <attributes>[[B]]</attributes>
        </module_path_item>
        <module_path_item>
          <try>C_.(!=)</try>
          <success>.(!=)</success>
          <attributes>[[C, _]]</attributes>
        </module_path_item>
        <module_path_item>
          <try>(!=)</try>
          <fail/>
        </module_path_item>
        <success>.(!=)</success>
        <attributes>[[[A], [B], [C, _]]]</attributes>
      </module_path_>
      <value_name_>
        <try>(!=)</try>
        <lowercase_ident>
          <try>(!=)</try>
          <fail/>
        </lowercase_ident>
        <operator_name>
          <try>!=)</try>
          <success>)</success>
          <attributes>[[!, =]]</attributes>
        </operator_name>
        <success></success>
        <attributes>[[!, =]]</attributes>
      </value_name_>
      <success></success>
      <attributes>[[[[A], [B], [C, _]], [!, =]]]</attributes>
    </start>
    Parse succeeded
    --------------------------------------------------------------
    Parsing '(!)'
    <start>
      <try>(!)</try>
      <module_path_>
        <try>(!)</try>
        <module_path_item>
          <try>(!)</try>
          <fail/>
        </module_path_item>
        <fail/>
      </module_path_>
      <value_name_>
        <try>(!)</try>
        <lowercase_ident>
          <try>(!)</try>
          <fail/>
        </lowercase_ident>
        <operator_name>
          <try>!)</try>
          <success>)</success>
          <attributes>[[!]]</attributes>
        </operator_name>
        <success></success>
        <attributes>[[!]]</attributes>
      </value_name_>
      <success></success>
      <attributes>[[[], [!]]]</attributes>
    </start>
    Parse succeeded