Search code examples
boostboost-spiritboost-phoenix

Boost Spirit tokenise expression into vector


I am trying to parse an expression which can also contain identifiers and push each element into an std::vector <std::string>, and I have come up with the following grammar:

#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix_core.hpp>
#include <boost/spirit/include/phoenix_operator.hpp>
#include <vector>

namespace qi = boost::spirit::qi;

struct Tokeniser
    : boost::spirit::qi::grammar <std::string::const_iterator, std::vector <std::string> (), boost::spirit::ascii::space_type>
{
    Tokeniser() : Tokeniser::base_type(expression)
    {
        namespace qi = boost::spirit::qi;
        expression = 
            term >>
            *( (qi::string("+")[qi::_val.push_back(qi::_1)] >> term) |
               (qi::string("-")[qi::_val.push_back(qi::_1)] >> term) );

        term = 
            factor >>
            *( (qi::string("*")[qi::_val.push_back(qi::_1)] >> factor) |
               (qi::string("/")[qi::_val.push_back(qi::_1)] >> factor) );

        factor =
            (identifier | myDouble_)[qi::_val.push_back(qi::_1)] |
            qi::string("(")[qi::_val.push_back(qi::_1)] >> expression >> qi::string(")")[qi::_val.push_back(qi::_1)];

        identifier = qi::raw [ qi::lexeme[ (qi::alpha | '_') >> *(qi::alnum | '_') ] ]; 

        myDouble_ = qi::raw [ qi::double_ ];
    }

    boost::spirit::qi::rule<std::string::const_iterator, std::vector <std::string> (), boost::spirit::ascii::space_type> expression;
    boost::spirit::qi::rule<std::string::const_iterator, boost::spirit::ascii::space_type>                               factor;
    boost::spirit::qi::rule<std::string::const_iterator, boost::spirit::ascii::space_type>                               term;        

    boost::spirit::qi::rule<std::string::const_iterator, std::string(), boost::spirit::ascii::space_type> identifier;
    boost::spirit::qi::rule<std::string::const_iterator, std::string(), boost::spirit::ascii::space_type> myDouble_;
};

However, I get the following error 'const struct boost::phoenix::actor<boost::spirit::attribute<0> >' has no member named 'push_back'.

Is there a direct way to perform what I am trying to do?


Solution

  • Yeah, the placeholder types do (obviously) not have push_back member.

    C++ is strong typed. Any deferred action is an "illusion": actors are represented in expression templates by composing special-purpose types that can be "evaluated" later.

    Intro To Expression Templates

    Live On Coliru

    Just in case you want to wrap your head around how it actually works, a simple example from scratch. The comments describe what the various parts of the code do:

    // we have lazy placeholder types:
    template <int N> struct placeholder {};
    placeholder<1> _1;
    placeholder<2> _2;
    placeholder<3> _3;
    
    // note that every type here is stateless, and acts just like a more
    // complicated placeholder.
    // We can have expressions, like binary addition:
    template <typename L, typename R> struct addition { };
    template <typename L, typename R> struct multiplication { };
    
    // here is the "factory" for our expression template:
    template <typename L, typename R> addition<L,R> operator+(L const&, R const&) { return {}; }
    template <typename L, typename R> multiplication<L,R> operator*(L const&, R const&) { return {}; }
    
    ///////////////////////////////////////////////
    // To evaluate/interpret the expressions, we have to define "evaluation" for each type of placeholder:
    template <typename Ctx, int N> 
        auto eval(Ctx& ctx, placeholder<N>) { return ctx.arg(N); }
    template <typename Ctx, typename L, typename R>
        auto eval(Ctx& ctx, addition<L, R>) { return eval(ctx, L{}) + eval(ctx, R{}); }
    template <typename Ctx, typename L, typename R>
        auto eval(Ctx& ctx, multiplication<L, R>) { return eval(ctx, L{}) * eval(ctx, R{}); }
    
    ///////////////////////////////////////////////
    // A simple real-life context would contain the arguments:
    #include <vector>
    struct Context {
        std::vector<double> _args;
    
        // define the operation to get an argument from this context:
        double arg(int i) const { return _args.at(i-1); }
    };
    
    #include <iostream>
    int main() {
        auto foo = _1 + _2 + _3;
    
        Context ctx { { 3, 10, -4 } };
        std::cout << "foo: " << eval(ctx, foo) << "\n";
        std::cout << "_1 + _2 * _3: " << eval(ctx, _1 + _2 * _3) << "\n";
    }
    

    Output is exactly what you'd expect:

    foo: 9
    _1 + _2 * _3: -37
    

    Fixing The Action:

    You'll have to "describe" the push_back operation instead of trying to find such an operation on the placeholder. Phoenix has your back:

    #include <boost/phoenix/stl.hpp>
    

    Now I'd simplify the actions to use phoenix::push_back:

    auto push = px::push_back(qi::_val, qi::_1);
    
    expression = 
        term >>
        *( (qi::string("+")[push] >> term) |
           (qi::string("-")[push] >> term) );
    
    term = 
        factor >>
        *( (qi::string("*")[push] >> factor) |
           (qi::string("/")[push] >> factor) );
    
    factor =
        (identifier | myDouble_)[push] |
        qi::string("(")[push] >> expression >> qi::string(")")[push];
    
    // etc.
    

    However, this has the additional problem that _val gets resolved to the attribute type of the rule. But some of your rules do not declare an attribute type, so it defaults to qi::unused_type. Clearly, generating the "push_back" evaluation code for that attribute is not going to work for unused_type.

    Let's fix those declarations:

    qi::rule<It, std::vector<std::string>(), boost::spirit::ascii::space_type> expression;
    qi::rule<It, std::vector<std::string>(), boost::spirit::ascii::space_type> factor;
    qi::rule<It, std::vector<std::string>(), boost::spirit::ascii::space_type> term;
    

    Other Problems!

    When we do this, the tokens remain largely empty. What gives?

    In the presence of semantic actions, automatic attribute propagation is inhibited. So, you have to work to get the contents of sub expressions appended to the final token vector.

    Again, using Phoenix's STL support:

    auto push      = px::push_back(qi::_val, qi::_1);
    auto propagate = px::insert(qi::_val, px::end(qi::_val), px::begin(qi::_1), px::end(qi::_1));
    
    expression = 
        term[propagate] >>
        *( (qi::string("+")[push] >> term[propagate]) |
           (qi::string("-")[push] >> term[propagate]) );
    
    term = 
        factor[propagate] >>
        *( (qi::string("*")[push] >> factor[propagate]) |
           (qi::string("/")[push] >> factor[propagate]) );
    
    factor =
        (identifier | myDouble_)[push] |
        qi::string("(")[push] >> expression[propagate] >> qi::string(")")[push];
    

    Now, a test with Live On Coliru

    #define BOOST_SPIRIT_DEBUG
    #include <boost/spirit/include/qi.hpp>
    #include <boost/spirit/include/phoenix.hpp>
    #include <boost/phoenix/stl.hpp>
    #include <vector>
    
    namespace qi = boost::spirit::qi;
    namespace px = boost::phoenix;
    
    template <typename It = std::string::const_iterator>
    struct Tokeniser
        : qi::grammar <It, std::vector <std::string> (), boost::spirit::ascii::space_type>
    {
        Tokeniser() : Tokeniser::base_type(expression)
        {
            auto push      = px::push_back(qi::_val, qi::_1);
            auto propagate = px::insert(qi::_val, px::end(qi::_val), px::begin(qi::_1), px::end(qi::_1));
    
            expression = 
                term[propagate] >>
                *( (qi::string("+")[push] >> term[propagate]) |
                   (qi::string("-")[push] >> term[propagate]) );
    
            term = 
                factor[propagate] >>
                *( (qi::string("*")[push] >> factor[propagate]) |
                   (qi::string("/")[push] >> factor[propagate]) );
    
            factor =
                (identifier | myDouble_)[push] |
                qi::string("(")[push] >> expression[propagate] >> qi::string(")")[push];
    
            identifier = qi::raw [ qi::lexeme[ (qi::alpha | '_') >> *(qi::alnum | '_') ] ]; 
    
            myDouble_ = qi::raw [ qi::double_ ];
    
            BOOST_SPIRIT_DEBUG_NODES((expression)(term)(factor)(identifier)(myDouble_))
        }
    
        qi::rule<It, std::vector<std::string>(), boost::spirit::ascii::space_type> expression;
        qi::rule<It, std::vector<std::string>(), boost::spirit::ascii::space_type> factor;
        qi::rule<It, std::vector<std::string>(), boost::spirit::ascii::space_type> term;
        qi::rule<It, std::string(), boost::spirit::ascii::space_type> identifier;
        qi::rule<It, std::string(), boost::spirit::ascii::space_type> myDouble_;
    };
    
    int main() {
        Tokeniser<> tok;
    
        std::string const input = "x + 89/(y*y)";
    
        auto f = input.begin(), l = input.end();
        std::vector<std::string> tokens;
        if (phrase_parse(f, l, tok, boost::spirit::ascii::space, tokens)) {
            std::cout << "Parsed " << tokens.size() << " tokens:\n";
            for (auto& token : tokens)
                std::cout << " - '" << token << "'\n";
        } else {
            std::cout << "Parse failed\n";
        }
    
        if (f != l)
            std::cout << "Remaining unparsed input: '" << std::string(f,l) << "'\n";
    }
    

    Prints

    Parsed 9 tokens:
     - 'x'
     - '+'
     - '89'
     - '/'
     - '('
     - 'y'
     - '*'
     - 'y'
     - ')'
    

    The Big Cleanup

    In general, avoid semantic actions (see my answer Boost Spirit: "Semantic actions are evil"? - notably the bullet about side-effects). Most of the time, you can get away with automatic attribute propagation. I'd say this is the key selling point of Boost Spirit.

    Further simplifying things around skipping/lexemes (Boost spirit skipper issues) does reduce code and compilation times significantly:

    Live On Coliru

    #define BOOST_SPIRIT_DEBUG
    #include <boost/spirit/include/qi.hpp>
    #include <vector>
    
    namespace qi = boost::spirit::qi;
    
    template <typename It = std::string::const_iterator>
    struct Tokeniser : qi::grammar <It, std::vector <std::string>()> {
        Tokeniser() : Tokeniser::base_type(start)
        {
            start = qi::skip(boost::spirit::ascii::space) [expression];
            expression =
                term >>
                *( (qi::string("+") >> term) |
                   (qi::string("-") >> term) );
    
            term =
                factor >>
                *( (qi::string("*") >> factor) |
                   (qi::string("/") >> factor) );
    
            factor =
                (identifier | myDouble_) |
                qi::string("(") >> expression >> qi::string(")");
    
            identifier = qi::raw [ (qi::alpha | '_') >> *(qi::alnum | '_') ]; 
            myDouble_  = qi::raw [ qi::double_ ];
    
            BOOST_SPIRIT_DEBUG_NODES((expression)(term)(factor)(identifier)(myDouble_))
        }
    
        qi::rule<It, std::vector<std::string>()> start;
        qi::rule<It, std::vector<std::string>(), boost::spirit::ascii::space_type> expression, factor, term;
        qi::rule<It, std::string()> identifier, myDouble_;
    };
    
    int main() {
        Tokeniser<> tok;
    
        std::string const input = "x + 89/(y*y)";
    
        auto f = input.begin(), l = input.end();
        std::vector<std::string> tokens;
        if (parse(f, l, tok, tokens)) {
            std::cout << "Parsed " << tokens.size() << " tokens:\n";
            for (auto& token : tokens)
                std::cout << " - '" << token << "'\n";
        } else {
            std::cout << "Parse failed\n";
        }
    
        if (f != l)
            std::cout << "Remaining unparsed input: '" << std::string(f,l) << "'\n";
    }
    

    Still prints

    Parsed 9 tokens:
     - 'x'
     - '+'
     - '89'
     - '/'
     - '('
     - 'y'
     - '*'
     - 'y'
     - ')'
    

    Remaining problems

    Have you thought about backtracking behaviour? I think you need some judiciously placed qi::hold[] directives inside your rules, see e.g. Understanding Boost.spirit's string parser

    BIG QUESTIONS:

    1. Why are you "parsing" when all you want is tokens?
    2. What will you do with the tokens? The parser builds a lot of information, only to discard it again