Search code examples
c++boostboost-spiritboost-spirit-qiboost-spirit-lex

Why does qi::skip fail with tokens from the lexer?


I'm using boost::spirit lex and qi to parse some source code.

I already skip white spaces from the input string using the lexer. What I would like to do is to switch skipping the comments depending on the context in the parser.

Here is a basic demo. See the comments in Grammar::Grammar() for my problem:

#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/lex_lexertl.hpp>
#include <boost/spirit/include/phoenix.hpp>

#include <iostream>

namespace lex = boost::spirit::lex;
namespace qi = boost::spirit::qi;
namespace phx = boost::phoenix;

typedef lex::lexertl::token<char const*, boost::mpl::vector<std::string>, boost::mpl::false_ > token_type;
typedef lex::lexertl::actor_lexer<token_type> lexer_type;

struct TokenId
{
   enum type
   {
      INVALID_TOKEN_ID = lex::min_token_id,
      COMMENT
   };
};

struct Lexer : lex::lexer<lexer_type>
{
public:
   lex::token_def<std::string> comment;
   lex::token_def<std::string> identifier;
   lex::token_def<std::string> lineFeed;
   lex::token_def<std::string> space;

   Lexer()
   {
      comment = "\\/\\*.*?\\*\\/|\\/\\/[^\\r\\n]*";
      identifier = "[A-Za-z_][A-Za-z0-9_]*";
      space = "[\\x20\\t\\f\\v]+";
      lineFeed = "(\\r\\n)|\\r|\\n";

      this->self = space[lex::_pass = lex::pass_flags::pass_ignore];
      this->self += lineFeed[lex::_pass = lex::pass_flags::pass_ignore];
      this->self.add
         (comment, TokenId::COMMENT)
         (identifier)
         (';')
         ;
   }
};

typedef Lexer::iterator_type Iterator;

void traceComment(const std::string& content)
{
   std::cout << "  comment: " << content << std::endl;
}

class Grammar : public qi::grammar<Iterator>
{
   typedef token_type skipped_t;

   qi::rule<Iterator, qi::unused_type, qi::unused_type> m_start;
   qi::rule<Iterator, qi::unused_type, qi::unused_type, skipped_t> m_variable;
   qi::rule<Iterator, std::string(), qi::unused_type> m_comment;

public:
   Lexer lx;

public:
   Grammar() :
      Grammar::base_type(m_start)
   {
// This does not work (comments are not skipped in m_variable)
      m_start = *(
            m_comment[phx::bind(&traceComment, qi::_1)]
         |  qi::skip(qi::token(TokenId::COMMENT))[m_variable]
         );

      m_variable = lx.identifier >> lx.identifier >> ';';
      m_comment = qi::token(TokenId::COMMENT);
/** But this works:
      m_start = *(
         m_comment[phx::bind(&traceComment, qi::_1)]
         | m_variable
         );

      m_variable = qi::skip(qi::token(TokenId::COMMENT))[lx.identifier >> lx.identifier >> ';'];
      m_comment = qi::token(TokenId::COMMENT);
*/
   }
};

void test(const char* code)
{
   std::cout << code << std::endl;

   Grammar parser;
   const char* begin = code;
   const char* end = code + strlen(code);
   tokenize_and_parse(begin, end, parser.lx, parser);

   if (begin == end)
      std::cout << "-- OK --" << std::endl;
   else
      std::cout << "-- FAILED --" << std::endl;
   std::cout << std::endl;
}

int main(int argc, char* argv[])
{
   test("/* kept */ int foo;");
   test("int /* ignored */ foo;");
   test("int foo /* ignored */;");
   test("int foo; // kept");
}

The output is:

/* kept */ int foo;
  comment: /* kept */
-- OK --

int /* ignored */ foo;
-- FAILED --

int foo /* ignored */;
-- FAILED --

int foo; // kept
  comment: // kept
-- OK --

Is there any issue with skipped_t?


Solution

  • The behavior you are describing is what I would expect from my experience.

    When you write

    my_rule = qi::skip(ws) [ foo >> lit(',') >> bar >> lit('=') >> baz ];
    

    this is essentially the same as writing

    my_rule = *ws >> foo >> *ws >> lit(',') >> *ws >> bar >> *ws >> lit('=') >> *ws >> baz;
    

    (assuming that ws is rule with no attribute. If it has an attribute in your grammar, that attribute is ignored, as if using qi::omit.)

    Notably, the skipper does not get propogated inside of the foo rule. So foo, bar, and baz can still be whitespace-sensitive in the above. What the skip directive is doing is causing the grammar not to care about leading whitespace in this rule, or whitespace around the ',' and '=' in this rule.

    More info here: http://boost-spirit.com/home/2010/02/24/parsing-skippers-and-skipping-parsers/


    Edit:

    Also, I don't think the skipped_t is doing what you think it is there.

    When you use a custom skipper, most straightforwardly you specify an actual instance of a parser as the skip parser for that rule. When you use a type instead of an object e.g. qi::skip(qi::blank_type), that is a shorthand, where the tag-type qi::blank_type has been linked via prior template declarations to the type qi::blank, and qi knows that when it sees qi::blank_type in certain places that it should instantiate a qi::blank parser object.

    I don't see any evidence that you've actually set up that machinery, you've just typedef'ed skipped_t to token_type. What you should do if you want this to work that way (if it's even possible, I don't know) is read about qi customization points and instead declare qi::skipped_t as an empty struct which is linked via some template boiler plate to the rule m_comment, which is presumably what you actually want to be skipping. (If you skip all tokens of all types, then you can't possibly match anything so that wouldn't make sense, so I'm not sure what your intention was with making token_type the skipper.)

    My guess is that when qi saw that typedef token_type in your parameter list, that it either ignored it or interprets it as part of the return value of the rule or something like this, not sure exactly what it would do.