Search code examples
c++boost-spirit-qi

How to keep space character on breaking down input into a sequence of different parts using Alternative Parser?


I want to write a simple C++ parser which extracts the block hierarchy. I am using this rule:

std::string rest_content;
std::vector<boost::tuple<std::string, std::string>> scopes;
qi::rule<It, qi::ascii::space_type> block = *(
            r.comment
        |   r.scope[push_back(boost::phoenix::ref(scopes), _1)]
        |   qi::char_[boost::phoenix::ref(rest_content) += _1] // rest
    );

qi::phrase_parse(first, last,
        block,
        ascii::space);

which is supposed to break down code into three parts: comment, scope (code surrounded by "{}") and "rest". The problem is that all the space characters are removed from the "rest". I need those spaces for later parsing (such as extracting identifiers).

I have tried using qi::skip, qi::lexeme and qi::raw to keep spaces:

// one of many failed attempts
qi::rule<It, qi::ascii::space_type> block = qi::lexeme[*(
            qi::skip[r.comment]
        |   qi::skip[r.scope[push_back(boost::phoenix::ref(scopes), _1)]]
        |   qi::char_[push_back(boost::phoenix::ref(rest_content), _1)]
    )];

but it never works.

So how to keep space characters? Any help is welcome. Thanks.


Solution

  • If you're parsing C++ code this way you may be biting off more than you can chew.

    I'll answer, but the answer should show you how limited this approach is going to be. Just imagine parsing through

    namespace q::x {
        namespace y {
            struct A {
                template <typename = ns1::c<int>, typename...> struct C;
            };
    
            template <typename T, typename... Ts>
            struct A::C final : ns2::ns3::base<A::C<T, Ts...>, Ts...> {
                 int simple = [](...) {
                      enum class X : unsigned { answer = 42, };
                      struct {
                          auto operator()(...) -> decltype(auto) {
                               return static_cast<int>(X::answer);
                          } 
                      } iife;
                      return iife();
                 }("/* }}} */"); // {{{
            };
        }
    }
    

    and getting it right. And yes. that's valid code.

    In fact it's so tricky, that it's easy to make "grown compilers" (GCC) trip: https://wandbox.org/permlink/FzcaSl6tbn18jq4f (Clang has no issue: https://wandbox.org/permlink/wu0mFwQiTOogKB5L).

    That said, let me refer to my old explanation of how rule declarations and skippers work together: Boost spirit skipper issues

    And show an approximation of what I'd do.

    The comments

    Actually, the comments should be part of your skipper, so let's make it so:

    using SkipRule = qi::rule<It>;
    SkipRule comment_only 
        = "//" >> *~qi::char_("\r\n") >> qi::eol
        | "/*" >> *(qi::char_ - "*/") >> "*/"
        ;
    

    Now for general skipping, we want to include whitespace:

    SkipRule comment_or_ws
        = qi::space | comment_only;
    

    Now we want to parse types and identifiers:

    qi::rule<It, std::string()> type
        = ( qi::string("struct") 
          | qi::string("class") 
          | qi::string("union") 
          | qi::string("enum") >> -(*comment_or_ws >> qi::string("class"))
          | qi::string("namespace") 
          )
        >> !qi::graph // must be followed by whitespace
        ;
    
    qi::rule<It, std::string()> identifier = 
        qi::char_("a-zA-Z_") >> *qi::char_("a-zA-Z_0-9")
        ;
    

    I've /guessed/ that struct X { }; would be an example of a "scope" for you, and the tuple would contain ("struct", "X").

    As a bonus I used attribute adaption of std::pair and show how to insert into a multimap for good measure later on

    qi::rule<It, std::pair<std::string, std::string>()> scope
        = qi::skip(comment_or_ws.alias()) [
            type >> identifier
            >> *~qi::char_(";{") // ignore some stuff like base classes
            >> qi::omit["{" >> *~qi::char_("}") >> "}" | ';']
        ];
    

    Note a big short-coming here is that the first non-commented '}' will "end" the scope. That's not how the language works (see the leading example)

    Now we can conclude with an improved "block" rule:

    qi::rule<It, SkipRule> block 
        = *(
            scope [px::insert(px::ref(scopes), _1)]
          | qi::skip(comment_only.alias()) [ 
                qi::as_string[qi::raw[+(qi::char_ - scope)]] [px::ref(rest_content) += _1]
          ] // rest
        );
    

    Note that - we override the comment_or_ws skipper with comment_only so we don't drop all whitespace from "rest content" - inversely, we override the skipper to include whitespace inside the scope rule because otherwise the negative scope invocation (char_ - scope) would do the wrong thing because it wouldn't skip whitespace

    Full Demo

    Live On Coliru

    //#define BOOST_SPIRIT_DEBUG
    #include <boost/fusion/adapted/std_pair.hpp>
    #include <boost/spirit/include/qi.hpp>
    #include <boost/spirit/include/phoenix.hpp>
    
    namespace qi = boost::spirit::qi;
    namespace px = boost::phoenix;
    
    int main() {
        using It = std::string::const_iterator;
        using namespace qi::labels;
    
        std::string rest_content;
        std::multimap<std::string, std::string> scopes;
    
        using SkipRule = qi::rule<It>;
        SkipRule comment_only 
            = "//" >> *~qi::char_("\r\n") >> qi::eol
            | "/*" >> *(qi::char_ - "*/") >> "*/"
            ;
    
        SkipRule comment_or_ws
            = qi::space | comment_only;
    
        qi::rule<It, std::string()> type
            = ( qi::string("struct") 
              | qi::string("class") 
              | qi::string("union") 
              | qi::string("enum") >> -(*comment_or_ws >> qi::string("class"))
              | qi::string("namespace") 
              )
            >> !qi::graph // must be followed by whitespace
            ;
    
        qi::rule<It, std::string()> identifier = 
            qi::char_("a-zA-Z_") >> *qi::char_("a-zA-Z_0-9")
            ;
    
        qi::rule<It, std::pair<std::string, std::string>()> scope
            = qi::skip(comment_or_ws.alias()) [
                type >> identifier
                >> *~qi::char_(";{") // ignore some stuff like base classes
                >> qi::omit["{" >> *~qi::char_("}") >> "}" | ';']
            ];
    
        qi::rule<It, SkipRule> block 
            = *(
                scope [px::insert(px::ref(scopes), _1)]
              | qi::skip(comment_only.alias()) [ 
                    qi::as_string[qi::raw[+(qi::char_ - scope)]] [px::ref(rest_content) += _1]
              ] // rest
            );
    
        //BOOST_SPIRIT_DEBUG_NODES((block)(scope)(identifier)(type))
    
        std::string const code = R"(
    // some random sample "code"
    struct base { 
        std::vector<int> ints;
    };
    /* class skipped_comment : base { };
     */
    
    namespace q { namespace nested { } } // nested is not supported
    
    class forward_declared;
    
    template <typename T> // actually basically ignored
    class
            Derived 
    : base {
                std::string more_data_members;
    };
    
    enum class MyEnum : int32_t {
        foo = 0,
        bar, /* whoop } */
        qux = foo + bar
    };
    
    int main() {
        return 0;
    }
                )";
    
        qi::phrase_parse(begin(code), end(code), block, comment_or_ws);
    
        for (auto& [k,v] : scopes) {
            std::cout << k << ": " << v << "\n";
        }
    
        std::cout << "------------------ BEGIN REST_CONTENT -----------------\n";
        std::cout << rest_content << "\n";
        std::cout << "------------------ END REST_CONENT --------------------\n";
    }
    

    Which parses the following sample input:

    // some random sample "code"
    struct base { 
        std::vector<int> ints;
    };
    /* class skipped_comment : base { };
     */
    
    namespace q { namespace nested { } } // nested is not supported
    
    class forward_declared;
    
    template <typename T> // actually basically ignored
    class
            Derived 
    : base {
                std::string more_data_members;
    };
    
    enum class MyEnum : int32_t {
        foo = 0,
        bar, /* whoop } */
        qux = foo + bar
    };
    
    int main() {
        return 0;
    }
    

    Printing

    class: forward_declared
    class: Derived
    enumclass: MyEnum
    namespace: q
    struct: base
    ------------------ BEGIN REST_CONTENT -----------------
    ;}template <typename T>;;
    
    int main() {
        return 0;
    }
    
    ------------------ END REST_CONENT --------------------
    

    Conclusion

    This result seems a decent pointer to

    • explain how to tackle the specific hurdle
    • demonstrate how this approach to parsing is breaking down at the slightest obstacle (namespace a { namespace b { } } for example)

    Caveat Emptor