How to keep space character on breaking down input into a sequence of different parts using Alternative Parser?

I want to write a simple C++ parser which extracts the block hierarchy. I am using this rule:

std::string rest_content;
std::vector<boost::tuple<std::string, std::string>> scopes;
qi::rule<It, qi::ascii::space_type> block = *(
            r.comment
        |   r.scope[push_back(boost::phoenix::ref(scopes), _1)]
        |   qi::char_[boost::phoenix::ref(rest_content) += _1] // rest
    );

qi::phrase_parse(first, last,
        block,
        ascii::space);

which is supposed to break down code into three parts: comment, scope (code surrounded by "{}") and "rest". The problem is that all the space characters are removed from the "rest". I need those spaces for later parsing (such as extracting identifiers).

I have tried using qi::skip, qi::lexeme and qi::raw to keep spaces:

// one of many failed attempts
qi::rule<It, qi::ascii::space_type> block = qi::lexeme[*(
            qi::skip[r.comment]
        |   qi::skip[r.scope[push_back(boost::phoenix::ref(scopes), _1)]]
        |   qi::char_[push_back(boost::phoenix::ref(rest_content), _1)]
    )];

but it never works.

So how to keep space characters? Any help is welcome. Thanks.

Solution

If you're parsing C++ code this way you may be biting off more than you can chew.

I'll answer, but the answer should show you how limited this approach is going to be. Just imagine parsing through

namespace q::x {
    namespace y {
        struct A {
            template <typename = ns1::c<int>, typename...> struct C;
        };

        template <typename T, typename... Ts>
        struct A::C final : ns2::ns3::base<A::C<T, Ts...>, Ts...> {
             int simple = [](...) {
                  enum class X : unsigned { answer = 42, };
                  struct {
                      auto operator()(...) -> decltype(auto) {
                           return static_cast<int>(X::answer);
                      } 
                  } iife;
                  return iife();
             }("/* }}} */"); // {{{
        };
    }
}

and getting it right. And yes. that's valid code.

In fact it's so tricky, that it's easy to make "grown compilers" (GCC) trip: https://wandbox.org/permlink/FzcaSl6tbn18jq4f (Clang has no issue: https://wandbox.org/permlink/wu0mFwQiTOogKB5L).

That said, let me refer to my old explanation of how rule declarations and skippers work together: Boost spirit skipper issues

And show an approximation of what I'd do.

The comments

Actually, the comments should be part of your skipper, so let's make it so:

using SkipRule = qi::rule<It>;
SkipRule comment_only 
    = "//" >> *~qi::char_("\r\n") >> qi::eol
    | "/*" >> *(qi::char_ - "*/") >> "*/"
    ;

Now for general skipping, we want to include whitespace:

SkipRule comment_or_ws
    = qi::space | comment_only;

Now we want to parse types and identifiers:

qi::rule<It, std::string()> type
    = ( qi::string("struct") 
      | qi::string("class") 
      | qi::string("union") 
      | qi::string("enum") >> -(*comment_or_ws >> qi::string("class"))
      | qi::string("namespace") 
      )
    >> !qi::graph // must be followed by whitespace
    ;

qi::rule<It, std::string()> identifier = 
    qi::char_("a-zA-Z_") >> *qi::char_("a-zA-Z_0-9")
    ;

I've /guessed/ that struct X { }; would be an example of a "scope" for you, and the tuple would contain ("struct", "X").

As a bonus I used attribute adaption of std::pair and show how to insert into a multimap for good measure later on

qi::rule<It, std::pair<std::string, std::string>()> scope
    = qi::skip(comment_or_ws.alias()) [
        type >> identifier
        >> *~qi::char_(";{") // ignore some stuff like base classes
        >> qi::omit["{" >> *~qi::char_("}") >> "}" | ';']
    ];

Note a big short-coming here is that the first non-commented '}' will "end" the scope. That's not how the language works (see the leading example)

Now we can conclude with an improved "block" rule:

qi::rule<It, SkipRule> block 
    = *(
        scope [px::insert(px::ref(scopes), _1)]
      | qi::skip(comment_only.alias()) [ 
            qi::as_string[qi::raw[+(qi::char_ - scope)]] [px::ref(rest_content) += _1]
      ] // rest
    );

Note that - we override the comment_or_ws skipper with comment_only so we don't drop all whitespace from "rest content" - inversely, we override the skipper to include whitespace inside the scope rule because otherwise the negative scope invocation (char_ - scope) would do the wrong thing because it wouldn't skip whitespace

Full Demo

Live On Coliru

//#define BOOST_SPIRIT_DEBUG
#include <boost/fusion/adapted/std_pair.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>

namespace qi = boost::spirit::qi;
namespace px = boost::phoenix;

int main() {
    using It = std::string::const_iterator;
    using namespace qi::labels;

    std::string rest_content;
    std::multimap<std::string, std::string> scopes;

    using SkipRule = qi::rule<It>;
    SkipRule comment_only 
        = "//" >> *~qi::char_("\r\n") >> qi::eol
        | "/*" >> *(qi::char_ - "*/") >> "*/"
        ;

    SkipRule comment_or_ws
        = qi::space | comment_only;

    qi::rule<It, std::string()> type
        = ( qi::string("struct") 
          | qi::string("class") 
          | qi::string("union") 
          | qi::string("enum") >> -(*comment_or_ws >> qi::string("class"))
          | qi::string("namespace") 
          )
        >> !qi::graph // must be followed by whitespace
        ;

    qi::rule<It, std::string()> identifier = 
        qi::char_("a-zA-Z_") >> *qi::char_("a-zA-Z_0-9")
        ;

    qi::rule<It, std::pair<std::string, std::string>()> scope
        = qi::skip(comment_or_ws.alias()) [
            type >> identifier
            >> *~qi::char_(";{") // ignore some stuff like base classes
            >> qi::omit["{" >> *~qi::char_("}") >> "}" | ';']
        ];

    qi::rule<It, SkipRule> block 
        = *(
            scope [px::insert(px::ref(scopes), _1)]
          | qi::skip(comment_only.alias()) [ 
                qi::as_string[qi::raw[+(qi::char_ - scope)]] [px::ref(rest_content) += _1]
          ] // rest
        );

    //BOOST_SPIRIT_DEBUG_NODES((block)(scope)(identifier)(type))

    std::string const code = R"(
// some random sample "code"
struct base { 
    std::vector<int> ints;
};
/* class skipped_comment : base { };
 */

namespace q { namespace nested { } } // nested is not supported

class forward_declared;

template <typename T> // actually basically ignored
class
        Derived 
: base {
            std::string more_data_members;
};

enum class MyEnum : int32_t {
    foo = 0,
    bar, /* whoop } */
    qux = foo + bar
};

int main() {
    return 0;
}
            )";

    qi::phrase_parse(begin(code), end(code), block, comment_or_ws);

    for (auto& [k,v] : scopes) {
        std::cout << k << ": " << v << "\n";
    }

    std::cout << "------------------ BEGIN REST_CONTENT -----------------\n";
    std::cout << rest_content << "\n";
    std::cout << "------------------ END REST_CONENT --------------------\n";
}

Which parses the following sample input:

// some random sample "code"
struct base { 
    std::vector<int> ints;
};
/* class skipped_comment : base { };
 */

namespace q { namespace nested { } } // nested is not supported

class forward_declared;

template <typename T> // actually basically ignored
class
        Derived 
: base {
            std::string more_data_members;
};

enum class MyEnum : int32_t {
    foo = 0,
    bar, /* whoop } */
    qux = foo + bar
};

int main() {
    return 0;
}

Printing

class: forward_declared
class: Derived
enumclass: MyEnum
namespace: q
struct: base
------------------ BEGIN REST_CONTENT -----------------
;}template <typename T>;;

int main() {
    return 0;
}

------------------ END REST_CONENT --------------------

Conclusion

This result seems a decent pointer to

explain how to tackle the specific hurdle
demonstrate how this approach to parsing is breaking down at the slightest obstacle (namespace a { namespace b { } } for example)

Caveat Emptor