Search code examples
c++boost-spiritboost-spirit-x3

Make boost::spirit::symbol parser non greedy


I'd like to make a keyword parser that matches i.e. int, but does not match int in integer with eger left over. I use x3::symbols to get automatically get the parsed keyword represented as an enum value.

Minimal example:

#include <boost/spirit/home/x3.hpp>
#include <boost/spirit/home/x3/support/utility/error_reporting.hpp>

namespace x3 = boost::spirit::x3;

enum class TypeKeyword { Int, Float, Bool };

struct TypeKeywordSymbolTable : x3::symbols<TypeKeyword> {
    TypeKeywordSymbolTable()
    {
        add("float", TypeKeyword::Float)
           ("int",   TypeKeyword::Int)
           ("bool",  TypeKeyword::Bool);
    }
};
const TypeKeywordSymbolTable type_keyword_symbol_table;

struct TypeKeywordRID {};
using TypeKeywordRule = x3::rule<TypeKeywordRID, TypeKeyword>;

const TypeKeywordRule type_keyword = "type_keyword";
const auto type_keyword_def        = type_keyword_symbol_table;
BOOST_SPIRIT_DEFINE(type_keyword);

using Iterator = std::string_view::const_iterator;

/* Thrown when the parser has failed to parse the whole input stream. Contains
 * the part of the input stream that has not been parsed. */
class LeftoverError : public std::runtime_error {
  public:
    LeftoverError(Iterator begin, Iterator end)
            : std::runtime_error(std::string(begin, end))
    {}

    std::string_view get_leftover_data() const noexcept { return what(); }
};

template<typename Rule>
typename Rule::attribute_type parse(std::string_view input, const Rule& rule)
{
    Iterator begin = input.begin();
    Iterator end   = input.end();

    using ExpectationFailure = boost::spirit::x3::expectation_failure<Iterator>;
    typename Rule::attribute_type result;
    try {
        bool r = x3::phrase_parse(begin, end, rule, x3::space, result);
        if (r && begin == end) {
            return result;
        } else { // Occurs when the whole input stream has not been consumed.
            throw LeftoverError(begin, end);
        }
    } catch (const ExpectationFailure& exc) {
        throw LeftoverError(exc.where(), end);
    }
}

int main()
{
    // TypeKeyword::Bool is parsed and "ean" is leftover, but failed parse with
    // "boolean" leftover is desired.
    parse("boolean", type_keyword);

    // TypeKeyword::Int is parsed and "eger" is leftover, but failed parse with
    // "integer" leftover is desired.
    parse("integer", type_keyword);

    // TypeKeyword::Int is parsed successfully and this is the desired behavior.
    parse("int", type_keyword);
}

Basicly, I want integer not to be recognized as a keyword with additional eger left to parse.


Solution

  • I morphed the test cases into self-describing expectations:

    Live On Compiler Explorer

    Prints:

    FAIL  "boolean"    -> TypeKeyword::Bool    (expected Leftover:"boolean")
    FAIL  "integer"    -> TypeKeyword::Int     (expected Leftover:"integer")
    OK    "int"        -> TypeKeyword::Int    
    

    Now, the simplest, naive approach would be to make sure you parse till the eoi, by simply changing

    auto actual = parse(input, Parser::type_keyword);
    

    To

    auto actual = parse(input, Parser::type_keyword >> x3::eoi);
    

    And indeed the tests pass: Live

    OK    "boolean"    -> Leftover:"boolean"  
    OK    "integer"    -> Leftover:"integer"  
    OK    "int"        -> TypeKeyword::Int    
    

    However, this fits the tests, but not the goal. Let's imagine a more involved grammar, where type identifier; is to be parsed:

    auto identifier 
        = x3::rule<struct id_, Ast::Identifier> {"identifier"}
        = x3::lexeme[x3::char_("a-zA-Z_") >> *x3::char_("a-zA-Z_0-9")];
    
    auto type_keyword
        = x3::rule<struct tk_, Ast::TypeKeyword> {"type_keyword"}
        = type_;
    
    auto declaration 
        = x3::rule<struct decl_, Ast::Declaration>{"declaration"}
        = type_keyword >> identifier >> ';';
    

    I'll leave the details for Compiler Explorer:

    OK    "boolean b;" -> Leftover:"boolean b;"
    OK    "integer i;" -> Leftover:"integer i;"
    OK    "int j;"     -> (TypeKeyword::Int j)
    

    Looks good. But what if we add some interesting tests:

        {"flo at f;", LeftoverError("flo at f;")},
        {"boolean;", LeftoverError("boolean;")},
    

    It prints (Live)

    OK    "boolean b;" -> Leftover:"boolean b;"
    OK    "integer i;" -> Leftover:"integer i;"
    OK    "int j;"     -> (TypeKeyword::Int j)
    FAIL  "boolean;"   -> (TypeKeyword::Bool ean) (expected Leftover:"boolean;")
    

    So, the test cases were lacking. Your prose description is actually closer:

    I'd like to make a keyword parser that matches i.e. int, but does not match int in integer with eger left over

    That correctly implies you want to check the lexeme inside the type_keyword rule. A naive try might be checking that no identifier character follows the type keyword:

    auto type_keyword
        = x3::rule<struct tk_, Ast::TypeKeyword> {"type_keyword"}
        = type_ >> !identchar;
    

    Where identchar was factored out of identifier like so:

    auto identchar = x3::char_("a-zA-Z_0-9");
    
    auto identifier 
        = x3::rule<struct id_, Ast::Identifier> {"identifier"}
        = x3::lexeme[x3::char_("a-zA-Z_") >> *identchar];
    

    However, this doesn't work. Can you see why (peeking allowed: https://godbolt.org/z/jb4zfhfWb)?

    Our latest devious test case now passes (yay), but int j; is now rejected! If you think about it, it only makes sense, because you have spaced skipped.

    The essential word I used a moment ago was lexeme: you want to treat some units as lexemes (and whitespace stops the lexeme. Or rather, whitespace isn't automatically skipped inside the lexeme¹). So, a fix would be:

    auto type_keyword
        // ...
        = x3::lexeme[type_ >> !identchar];
    

    (Note how I sneakily already included that on the identifier rule earlier)

    Lo and behold (Live):

    OK    "boolean b;" -> Leftover:"boolean b;"
    OK    "integer i;" -> Leftover:"integer i;"
    OK    "int j;"     -> (TypeKeyword::Int j)
    OK    "boolean;"   -> Leftover:"boolean;" 
    

    Summarizing

    This topic is a frequently recurring one, and it requires a solid understanding of skippers, lexemes first and foremost. Here are some other posts for inspiration:

    Good luck!

    Complete Listing

    Anti-Bitrot, the final listing:

    #include <boost/fusion/adapted.hpp>
    #include <boost/fusion/include/as_vector.hpp>
    #include <boost/fusion/include/io.hpp>
    #include <boost/lexical_cast.hpp>
    #include <boost/spirit/home/x3.hpp>
    #include <iomanip>
    #include <iostream>
    
    namespace x3 = boost::spirit::x3;
    
    namespace Ast {
        enum class TypeKeyword { Int, Float, Bool };
    
        static std::ostream& operator<<(std::ostream& os, TypeKeyword tk) {
            switch (tk) {
                case TypeKeyword::Int:   return os << "TypeKeyword::Int";
                case TypeKeyword::Float: return os << "TypeKeyword::Float";
                case TypeKeyword::Bool:  return os << "TypeKeyword::Bool";
            };
            return os << "?";
        }
    
        using Identifier = std::string;
    
        struct Declaration {
            TypeKeyword type;
            Identifier id;
    
            bool operator==(Declaration const&) const = default;
        };
    
    } // namespace Ast
    
    BOOST_FUSION_ADAPT_STRUCT(Ast::Declaration, type, id)
    
    namespace Ast{
        static std::ostream& operator<<(std::ostream& os, Ast::Declaration const& d) {
            return os << boost::lexical_cast<std::string>(boost::fusion::as_vector(d));
        }
    } // namespace Ast
    
    namespace Parser {
        struct Type : x3::symbols<Ast::TypeKeyword> {
            Type() {
                add("float", Ast::TypeKeyword::Float) //
                    ("int", Ast::TypeKeyword::Int)    //
                    ("bool", Ast::TypeKeyword::Bool); //
            }
        } const static type_;
    
        auto identchar = x3::char_("a-zA-Z_0-9");
    
        auto identifier 
            = x3::rule<struct id_, Ast::Identifier> {"identifier"}
            = x3::lexeme[x3::char_("a-zA-Z_") >> *identchar];
    
        auto type_keyword
            = x3::rule<struct tk_, Ast::TypeKeyword> {"type_keyword"}
            = x3::lexeme[type_ >> !identchar];
    
        auto declaration 
            = x3::rule<struct decl_, Ast::Declaration>{"declaration"}
            = type_keyword >> identifier >> ';';
    } // namespace Parser
    
    struct LeftoverError : std::runtime_error {
        using std::runtime_error::runtime_error;
    
        friend std::ostream& operator<<(std::ostream& os, LeftoverError const& e) {
            return os << (std::string("Leftover:\"") + e.what() + "\"");
        }
        bool operator==(LeftoverError const& other) const {
            return std::string_view(what()) == other.what();
        }
    };
    
    template<typename T>
    using Maybe = boost::variant<T, LeftoverError>;
    
    template <typename Rule,
              typename Attr = typename x3::traits::attribute_of<Rule, x3::unused_type>::type,
              typename R = Maybe<Attr>>
    R parse(std::string_view input, Rule const& rule) {
        Attr result;
        auto f = input.begin(), l = input.end();
        return x3::phrase_parse(f, l, rule, x3::space, result)
            ? R(std::move(result))
            : LeftoverError({f, l});
    }
    
    int main()
    {
        using namespace Ast;
    
        struct {
            std::string_view        input;
            Maybe<Declaration>      expected;
        } cases[] = {
            {"boolean b;", LeftoverError("boolean b;")},
            {"integer i;", LeftoverError("integer i;")},
            {"int j;", Declaration{TypeKeyword::Int, "j"}},
            {"boolean;", LeftoverError("boolean;")},
        };
        for (auto [input, expected] : cases) {
            auto actual = parse(input, Parser::declaration >> x3::eoi);
            bool ok     = expected == actual;
    
            std::cout << std::left << std::setw(6) << (ok ? "OK" : "FAIL")
                      << std::setw(12) << std::quoted(input) << " -> "
                      << std::setw(20) << actual;
            if (not ok)
                std::cout << " (expected " << expected << ")";
            std::cout << "\n";
        }
    }
    

    ¹ see Boost spirit skipper issues