Search code examples
c++parsingboostboost-spiritboost-spirit-qi

boost::spirit::qi Expectation Parser and parser grouping unexpected behaviour


I'm hoping someone can shine a light through my ignorance of using the > and >> operators in spirit parsing.

I have a working grammar, where the top-level rule looks like

test = identifier >> operationRule >> repeat(1,3)[any_string] >> arrow >> any_string >> conditionRule;

It relies on attributes to automatically allocated parsed values to a fusion-adapted struct (ie a boost tuple).

However, I know that once we match the operationRule, we must continue or fail (ie we don't want to allow backtracking to try other rules that begin with identifier).

test = identifier >> 
           operationRule > repeat(1,3)[any_string] > arrow > any_string > conditionRule;

This causes a cryptic compiler error ('boost::Container' : use of class template requires template argument list). Futz around a bit and the following compiles:

test = identifier >> 
           (operationRule > repeat(1,3)[any_string] > arrow > any_string > conditionRule);

but the attribute setting no longer works - my data structure contains garbage after parsing. This can be fixed by adding actions like [at_c<0>(_val) = _1], but that seems a little clunky - as well as making things slower according to the boost docs.

So, my questions are

  1. Is it worth preventing back-tracking?
  2. Why do I need the grouping operator ()
  3. Does my last example above really stop back-tracking after operationRule is matched (I suspect not, it seems that if the entire parser inside the (...) fails backtracking will be allowed)?
  4. If the answer to the previous question is /no/, how do I construct a rule that allows backtracking if operation is /not/ matched, but does not allow backtracking once operation /is/ matched?
  5. Why does the grouping operator destroy the attribute grammar - requiring actions?

I realise this is quite a broad question - any hints that point in the right direction will be greatly appreciated!


Solution

    1. Is it worth preventing back-tracking?

      Absolutely. Preventing back tracking in general is a proven way to improve parser performance.

      • reduce the use of (negative) lookahead (operator !, operator - and some operator &)
      • order branches (operator |, operator ||, operator^ and some operator */-/+) such that the most frequent/likely branch is ordered first, or that the most costly branch is tried last

      Using expectation points (>) does not essentially reduce backtracking: it will just disallow it. This will enable targeted error messages, prevent useless 'parsing-into-the-unknown'.

    2. Why do I need the grouping operator ()

      I'm not sure. I had a check using my what_is_the_attr helpers from here

      • ident >> op >> repeat(1,3)[any] >> "->" >> any
        synthesizes attribute:

        fusion::vector4<string, string, vector<string>, string>
        
      • ident >> op > repeat(1,3)[any] > "->" > any
        synthesizes attribute:

        fusion::vector3<fusion::vector2<string, string>, vector<string>, string>
        

      I haven't found the need to group subexpressions using parentheses (things compile), but obviously DataT needs to be modified to match the changed layout.

      typedef boost::tuple<
          boost::tuple<std::string, std::string>, 
          std::vector<std::string>, 
          std::string
      > DataT;
      

    The full code below shows how I'd prefer to do that, using adapted structs.

    1. Does my above example really stop back-tracking after operationRule is matched (I suspect not, it seems that if the entire parser inside the (...) fails backtracking will be allowed)?

      Absolutely. If the expectation(s) is not met, a qi::expectation_failure<> exception is thrown. This by default aborts the parse. You could use qi::on_error to retry, fail, accept or rethrow. The MiniXML example has very good examples on using expectation points with qi::on_error

    2. If the answer to the previous question is /no/, how do I construct a rule that allows backtracking if operation is /not/ matched, but does not allow backtracking once operation /is/ matched?

    3. Why does the grouping operator destroy the attribute grammar - requiring actions?

      It doesn't destroy the attribute grammar, it just changes the exposed type. So, if you bind an appropriate attribute reference to the rule/grammar, you won't need semantic actions. Now, I feel there should be ways to go without the grouping , so let me try it (preferrably on your short selfcontained sample). And indeed I have found no such need. I've added a full example to help you see what is happening in my testing, and not using semantic actions.

    Full Code

    The full code show 5 scenarios:

    • OPTION 1: Original without expectations

      (no relevant changes)

    • OPTION 2: with expectations

      Using the modified typedef for DataT (as shown above)

    • OPTION 3: adapted struct, without expectations

      Using a userdefined struct with BOOST_FUSION_ADAPT_STRUCT

    • OPTION 4: adapted struct, with expectations

      Modifying the adapted struct from OPTION 3

    • OPTION 5: lookahead hack

      This one leverages a 'clever' (?) hack, by making all >> into expectations, and detecting the presence of a operationRule-match beforehand. This is of course suboptimal, but allows you to keep DataT unmodified, and without using semantic actions.

    Obviously, define OPTION to the desired value before compiling.

    #include <boost/spirit/include/qi.hpp>
    #include <boost/spirit/include/karma.hpp>
    #include <boost/spirit/include/phoenix.hpp>
    #include <boost/fusion/adapted.hpp>
    #include <iostream>
    
    namespace qi    = boost::spirit::qi; 
    namespace karma = boost::spirit::karma; 
    
    #ifndef OPTION
    #define OPTION 5
    #endif
    
    #if OPTION == 1 || OPTION == 5 // original without expectations (OR lookahead hack)
        typedef boost::tuple<std::string, std::string, std::vector<std::string>, std::string> DataT;
    #elif OPTION == 2 // with expectations
        typedef boost::tuple<boost::tuple<std::string, std::string>, std::vector<std::string>, std::string> DataT;
    #elif OPTION == 3 // adapted struct, without expectations
        struct DataT
        {
            std::string identifier, operation;
            std::vector<std::string> values;
            std::string destination;
        };
    
        BOOST_FUSION_ADAPT_STRUCT(DataT, (std::string, identifier)(std::string, operation)(std::vector<std::string>, values)(std::string, destination));
    #elif OPTION == 4 // adapted struct, with expectations
        struct IdOpT
        {
            std::string identifier, operation;
        };
        struct DataT
        {
            IdOpT idop;
            std::vector<std::string> values;
            std::string destination;
        };
    
        BOOST_FUSION_ADAPT_STRUCT(IdOpT, (std::string, identifier)(std::string, operation));
        BOOST_FUSION_ADAPT_STRUCT(DataT, (IdOpT, idop)(std::vector<std::string>, values)(std::string, destination));
    #endif
    
    template <typename Iterator>
    struct test_parser : qi::grammar<Iterator, DataT(), qi::space_type, qi::locals<char> >
    {
        test_parser() : test_parser::base_type(test, "test")
        {
            using namespace qi;
    
            quoted_string = 
                   omit    [ char_("'\"") [_a =_1] ]             
                >> no_skip [ *(char_ - char_(_a))  ]
                 > lit(_a); 
    
            any_string = quoted_string | +qi::alnum;
    
            identifier = lexeme [ alnum >> *graph ];
    
            operationRule = string("add") | "sub";
            arrow = "->";
    
    #if OPTION == 1 || OPTION == 3   // without expectations
            test = identifier >> operationRule >> repeat(1,3)[any_string] >> arrow >> any_string;
    #elif OPTION == 2 || OPTION == 4 // with expectations
            test = identifier >> operationRule  > repeat(1,3)[any_string]  > arrow  > any_string;
    #elif OPTION == 5                // lookahead hack
            test = &(identifier >> operationRule) > identifier > operationRule > repeat(1,3)[any_string] > arrow > any_string;
    #endif
        }
    
        qi::rule<Iterator, qi::space_type/*, qi::locals<char> */> arrow;
        qi::rule<Iterator, std::string(), qi::space_type/*, qi::locals<char> */> operationRule;
        qi::rule<Iterator, std::string(), qi::space_type/*, qi::locals<char> */> identifier;
        qi::rule<Iterator, std::string(), qi::space_type, qi::locals<char> > quoted_string, any_string;
        qi::rule<Iterator, DataT(),       qi::space_type, qi::locals<char> > test;
    };
    
    int main()
    {
        std::string str("addx001 add 'str1'   \"str2\"       ->  \"str3\"");
        test_parser<std::string::const_iterator> grammar;
        std::string::const_iterator iter = str.begin();
        std::string::const_iterator end  = str.end();
    
        DataT data;
        bool r = phrase_parse(iter, end, grammar, qi::space, data);
    
        if (r)
        {
            using namespace karma;
            std::cout << "OPTION " << OPTION << ": " << str << " --> ";
    #if OPTION == 1 || OPTION == 3 || OPTION == 5 // without expectations (OR lookahead hack)
            std::cout << format(delimit[auto_ << auto_ << '[' << auto_ << ']' << " --> " << auto_], data) << "\n";
    #elif OPTION == 2 || OPTION == 4 // with expectations
            std::cout << format(delimit[auto_ << '[' << auto_ << ']' << " --> " << auto_], data) << "\n";
    #endif
        }
        if (iter!=end)
            std::cout << "Remaining: " << std::string(iter,end) << "\n";
    }
    

    Output for all OPTIONS:

    for a in 1 2 3 4 5; do g++ -DOPTION=$a -I ~/custom/boost/ test.cpp -o test$a && ./test$a; done
    OPTION 1: addx001 add 'str1'   "str2"       ->  "str3" --> addx001 add [ str1 str2 ]  -->  str3 
    OPTION 2: addx001 add 'str1'   "str2"       ->  "str3" --> addx001 add [ str1 str2 ]  -->  str3 
    OPTION 3: addx001 add 'str1'   "str2"       ->  "str3" --> addx001 add [ str1 str2 ]  -->  str3 
    OPTION 4: addx001 add 'str1'   "str2"       ->  "str3" --> addx001 add [ str1 str2 ]  -->  str3 
    OPTION 5: addx001 add 'str1'   "str2"       ->  "str3" --> addx001 add [ str1 str2 ]  -->  str3