I am using Boost::Spirit to build simple "data-filter" language in my C++ GUI application for non-technical users. Language is very similiar to plain English and parse-able into AST. I am requested to make the process as user-friendly as possible, so I wish to provide CLang-like error messages ("Unrecognized 'tokken', did you meant 'token'?") and autocompletion.
The question in place is how to use Boost::Spirit grammar to generate possible token list for both goals (I will use simple string distance algorithm to satisfy first requirement)?
My ideas so far:
The problem with this solution is the suggestion will also suggest tokens that are invalid for given place. And if I add (and I will) definable identifiers, I have much bigger problem in hand...
One more constraint: I want to have grammar for this language defined in only one place; If the grammar changes, I want to autocompleter be aware of this changes after recompilation
Out of curiosity, I decided to try the concept.
Here's my attempt.
Let's parse arithmetic expressions with function calls.
Now, we want to parse (partial) expression with possible unknown identifiers.
In case of incomplete expressions, we want to "imply" the minimal token to complete it (and potentially continue parsing).
In case of unknown identifiers, we want to fuzzy-match the known identifiers in the domain (either variables or functions) and rank them in order of decreasing probability.
Let's start out by deciding we want our input to be in memory, so we can refer to locations/substrings by using string_view
s:
#include <boost/utility/string_view.hpp>
using Source = boost::string_view;
using Location = Source::const_iterator;
Besides the AST, we want our parser to generate completion hints (the implicit completion tokens and suggestions)
namespace Completion {
using Candidates = std::vector<std::string>;
class Hints {
struct ByLocation {
template <typename T, typename U>
bool operator()(T const& a, U const& b) const { return loc(a) < loc(b); }
private:
static Location loc(Source const& s) { return s.begin(); }
static Location loc(Location const& l) { return l; }
};
public:
std::map<Location, std::string, ByLocation> incomplete;
std::map<Source, Candidates, ByLocation> suggestions;
/*explicit*/ operator bool() const { return incomplete.size() || suggestions.size(); }
};
In addition, let's code up a quick & dirty fuzzy identifier match scoring function.
I opted for a simple synchronizing compare that
adj_diff
is a good fuzzy match for adjacent_difference
even though characters have been skipped from the candidate, but adj_qqq_diff
is worse because the qqq
from the input could not be matched)rate=1
at first invocation) static int fuzzy_match(Source input, boost::string_view candidate, int rate = 1) { // start with first-letter boost
int score = 0;
while (!(input.empty() || candidate.empty())) {
if (input.front() != candidate.front()) {
return score + std::max(
fuzzy_match(input.substr(1), candidate, std::max(rate-2,0)), //penalty for ignoring an input char
fuzzy_match(input, candidate.substr(1), std::max(rate-1,0)));
}
input.remove_prefix(1);
candidate.remove_prefix(1);
score += ++rate;
}
return score;
}
} // namespace Completion
We'll see how this is used in the grammar.
A run-of-the-mill AST that can do binary expressions, string/numeric literals, variables and (nested) function calls:
#include <boost/variant.hpp>
namespace Ast {
using NumLiteral = double;
using StringLiteral = std::string; // de-escaped, not source view
struct Identifier : std::string {
using std::string::string;
using std::string::operator=;
};
struct BinaryExpression;
struct CallExpression;
using Expression = boost::variant<
NumLiteral,
StringLiteral,
Identifier,
boost::recursive_wrapper<BinaryExpression>,
boost::recursive_wrapper<CallExpression>
>;
struct BinaryExpression {
Expression lhs;
char op;
Expression rhs;
};
using ArgList = std::vector<Expression>;
struct CallExpression {
Identifier function;
ArgList args;
};
}
The grammar starts off pretty "basic" as well:
namespace Parsing {
namespace qi = boost::spirit::qi;
namespace phx = boost::phoenix;
template <typename It>
struct Expression : qi::grammar<It, Ast::Expression()> {
Expression(Completion::Hints& hints) : Expression::base_type(start), _hints(hints) {
using namespace qi;
start = skip(space) [expression];
expression = term [_val = _1] >> *(char_("-+") >> expression) [_val = make_binary(_val, _1, _2)];
term = factor [_val = _1] >> *(char_("*/") >> term) [_val = make_binary(_val, _1, _2)];
factor = simple [_val = _1] >> *(char_("^") >> factor) [_val = make_binary(_val, _1, _2)];
simple = call | variable | compound | number | string;
So far so good: The constructor stores a reference to the Completion::Hints&
to be recorded. All these rules have been declared as qi::rule<It, Expression(), qi::space_type>
.
Now it gets slightly more interesting, some rules here are lexemes¹
number = double_;
identifier = raw[(alpha|'_') >> *(alnum|'_')];
And some productions are tolerant of missing closing tokens:
// imply the closing quotes
string %= '"' >> *('\\' >> char_ | ~char_('"')) >> implied('"');
compound %= '(' >> expression >> implied(')');
The implied
magic makes it so that if the expected closing token is missing, it will be recorded as a hint, and parsing continues:
auto implied = [=](char ch) {
return copy(omit[lit(ch) | raw[eps][imply(_1, ch)]]);
};
Of course, this begs the question what imply(_1, ch)
really does, and we'll see later.
For now, observe that we do
raw[eps]
to get an (empty) sourceiterator_range
to construct aLocation
from in the semantic action.
We store "symbol tables" in Domain
members _variables
and _functions
:
using Domain = qi::symbols<char>;
Domain _variables, _functions;
Then we declare some rules that can do lookups on either of them:
// domain identifier lookups
qi::_r1_type _domain;
qi::rule<It, Ast::Identifier(Domain const&)> maybe_known, known, unknown;
The corresponding declarations will be shown shortly.
Variables are pretty simple:
variable = maybe_known(phx::ref(_variables));
Calls are trickier. If a name is unknown we don't want to assume it implies a function unless it's followed by a '('
character. However, if an identifier is a known function name, we want even to imply the (
(this gives the UX the appearance of autocompletion where when the user types sqrt
, it suggests the next character to be (
magically).
// The heuristics:
// - an unknown identifier followed by (
// - an unclosed argument list implies )
call %= ( known(phx::ref(_functions)) // known -> imply the parens
| &(identifier >> '(') >> unknown(phx::ref(_functions))
) >> implied('(') >> -(expression % ',') >> implied(')');
It all builds on known
, unknown
and maybe_known
:
///////////////////////////////
// identifier loopkup, suggesting
{
maybe_known = known(_domain) | unknown(_domain);
// distinct to avoid partially-matching identifiers
using boost::spirit::repository::qi::distinct;
auto kw = distinct(copy(alnum | '_'));
known = raw[kw[lazy(_domain)]];
unknown = raw[identifier[_val=_1]] [suggest_for(_1, _domain)];
}
Remaining is a bit of red tape:
BOOST_SPIRIT_DEBUG_NODES(
(start)
(expression)(term)(factor)(simple)(compound)
(call)(variable)
(identifier)(number)(string)
//(maybe_known)(known)(unknown) // qi::symbols<> non-streamable
)
_variables += "foo", "bar", "qux";
_functions += "print", "sin", "tan", "sqrt", "frobnicate";
}
private:
Completion::Hints& _hints;
using Domain = qi::symbols<char>;
Domain _variables, _functions;
qi::rule<It, Ast::Expression()> start;
qi::rule<It, Ast::Expression(), qi::space_type> expression, term, factor, simple;
// completables
qi::rule<It, Ast::Expression(), qi::space_type> compound;
qi::rule<It, Ast::CallExpression(), qi::space_type> call;
// implicit lexemes
qi::rule<It, Ast::Identifier()> variable, identifier;
qi::rule<It, Ast::NumLiteral()> number;
qi::rule<It, Ast::StringLiteral()> string;
// domain identifier lookups
qi::_r1_type _domain;
qi::rule<It, Ast::Identifier(Domain const&)> maybe_known, known, unknown;
Let's start with the usual helper to construct binary AST nodes:
///////////////////////////////
// binary expression factory
struct make_binary_f {
Ast::BinaryExpression operator()(Ast::Expression const& lhs, char op, Ast::Expression const& rhs) const {
return {lhs, op, rhs};
}
};
boost::phoenix::function<make_binary_f> make_binary;
Similarly, we can have a Phoenix function object to register implied chars:
///////////////////////////////
// auto-completing incomplete expressions
struct imply_f {
Completion::Hints& _hints;
void operator()(boost::iterator_range<It> where, char implied_char) const {
auto inserted =
_hints.incomplete.emplace(&*where.begin(), std::string(1, implied_char));
// add the implied char to existing completion
if (!inserted.second)
inserted.first->second += implied_char;
}
};
boost::phoenix::function<imply_f> imply { imply_f { _hints } };
Note that it orders by location (which makes UX easier) and detects when a previous implied token lived at the same location, in which case we simply append to it.
By now it won't be much of a surprise that generating relevant suggestions for unknown identifiers is also a phoenix actor:
///////////////////////////////
// suggest_for
struct suggester {
Completion::Hints& _hints;
void operator()(boost::iterator_range<It> where, Domain const& symbols) const {
using namespace Completion;
Source id(&*where.begin(), where.size());
Candidates c;
symbols.for_each([&](std::string const& k, ...) { c.push_back(k); });
auto score = [id](Source v) { return fuzzy_match(id, v); };
auto byscore = [=](Source a, Source b) { return score(a) > score(b); };
sort(c.begin(), c.end(), byscore);
c.erase( remove_if(c.begin(), c.end(), [=](Source s) { return score(s) < 3; }), c.end());
_hints.suggestions.emplace(id, c);
}
};
boost::phoenix::function<suggester> suggest_for {suggester{_hints}};
That's all. If it looks more complicated, that's because it does a lot more: it scores all candidates fuzzily, orders them by descending score and removes any candidates with score < 3.
};
}
Let's alter things a little more and allow ','
to be implied inside argument lists:
call %= ( known(phx::ref(_functions)) // known -> imply the parens
| &(identifier >> '(') >> unknown(phx::ref(_functions))
)
>> implied('(')
>> -(expression % ',')
>> implied(')')
;
Let's replace ','
there:
>> -(expression % (',' | !(')'|eoi) >> implied(',')))
NOTE: It might seem to make more sense to detect
&expression
to assert that an expression follows, instead asserting that the end of the input/argument list has not been reached.Doing that goes bad, though, because any contained expressions could trigger more implied tokens as a side effect, leading to duplicated implied tokens.
This (side-effectful semantic actions) is one of the chief reasons why I usually avoid semantic actions, or use them to store effect only in the rule's (exposed) attribute(s). See Boost Spirit: "Semantic actions are evil"?
This post would be nothing without some nice test cases. So here goes:
int main() {
for (Source const input : {
"", // invalid
"(3", // incomplete, imply ')'
"3*(6+sqrt(9))^7 - 1e8", // completely valid
"(3*(((6+sqrt(9))^7 - 1e8", // incomplete, imply ")))"
"print(\"hello \\\"world!", // completes the string literal and the parameter list
"foo", // okay, known variable
"baz", // (suggest bar)
"baz(", // incomplete, imply ')', unknown function
"taz(", // incomplete, imply ')', unknown function
"san(", // 2 suggestions (sin/tan)
"print(1, 2, \"three\", complicated(san(78",
"(print sqrt sin 9) -0) \"bye",
})
{
std::cout << "-------------- '" << input << "'\n";
Location f = input.begin(), l = input.end();
Ast::Expression expr;
Completion::Hints hints;
bool ok = parse(f, l, Parsing::Expression<Location>{hints}, expr);
if (hints) {
std::cout << "Input: '" << input << "'\n";
}
for (auto& c : hints.incomplete) {
std::cout << "Missing " << std::setw(c.first - input.begin()) << "" << "^ implied: '" << c.second << "'\n";
}
for (auto& id : hints.suggestions) {
std::cout << "Unknown " << std::setw(id.first.begin() - input.begin()) << "" << std::string(id.first.size(), '^');
if (id.second.empty())
std::cout << " (no suggestions)\n";
else {
std::cout << " (did you mean ";
once_t first;
for (auto& s : id.second)
std::cout << (first?"":" or ") << "'" << s << "'";
std::cout << "?)\n";
}
}
if (ok) {
std::cout << "AST: " << expr << "\n";
} else {
std::cout << "Parse failed\n";
}
if (f!=l)
std::cout << "Remaining input: '" << std::string(f,l) << "'\n";
}
}
Note that, besides the first input (""
) everything is heuristically parsed as some kind of expression! The last one is designed to show off how all the parameter lists are implied because print
, sqrt
and sin
are known functions. Then some ','
are implied, before finally the unclosed string literal and remaining parentheses are closed. Here's the (non-debug) output:
-------------- ''
Parse failed
-------------- '(3'
Input: '(3'
Missing ^ implied: ')'
AST: 3
-------------- '3*(6+sqrt(9))^7 - 1e8'
AST: ((3 * ((6 + (sqrt (9))) ^ 7)) - 1e+08)
-------------- '(3*(((6+sqrt(9))^7 - 1e8'
Input: '(3*(((6+sqrt(9))^7 - 1e8'
Missing ^ implied: ')))'
AST: (3 * (((6 + (sqrt (9))) ^ 7) - 1e+08))
-------------- 'print("hello \"world!'
Input: 'print("hello \"world!'
Missing ^ implied: '")'
AST: (print (hello "world!))
-------------- 'foo'
AST: foo
-------------- 'baz'
Input: 'baz'
Unknown ^^^ (did you mean 'bar'?)
AST: baz
-------------- 'baz('
Input: 'baz('
Missing ^ implied: ')'
Unknown ^^^ (no suggestions)
AST: (baz ())
-------------- 'taz('
Input: 'taz('
Missing ^ implied: ')'
Unknown ^^^ (did you mean 'tan'?)
AST: (taz ())
-------------- 'san('
Input: 'san('
Missing ^ implied: ')'
Unknown ^^^ (did you mean 'sin' or 'tan'?)
AST: (san ())
-------------- 'print(1, 2, "three", complicated(san(78'
Input: 'print(1, 2, "three", complicated(san(78'
Missing ^ implied: ')))'
Unknown ^^^^^^^^^^^ (did you mean 'frobnicate' or 'print'?)
Unknown ^^^ (did you mean 'sin' or 'tan'?)
AST: (print (1, 2, three, (complicated ((san (78))))))
-------------- '(print sqrt sin 9) -0) "bye'
Input: '(print sqrt sin 9) -0) "bye'
Missing ^ implied: '('
Missing ^ implied: '('
Missing ^ implied: '('
Missing ^ implied: ','
Missing ^ implied: '"))'
AST: (print ((sqrt (((sin (9)) - 0))), bye))
//#define BOOST_SPIRIT_DEBUG
#include <boost/utility/string_view.hpp>
using Source = boost::string_view;
using Location = Source::const_iterator;
#include <map>
#include <vector>
namespace Completion {
static int fuzzy_match(Source input, boost::string_view candidate, int rate = 1) { // start with first-letter boost
int score = 0;
while (!(input.empty() || candidate.empty())) {
if (input.front() != candidate.front()) {
return score + std::max(
fuzzy_match(input.substr(1), candidate, std::max(rate-2,0)), //penalty for ignoring an input char
fuzzy_match(input, candidate.substr(1), std::max(rate-1,0)));
}
input.remove_prefix(1);
candidate.remove_prefix(1);
score += ++rate;
}
return score;
}
using Candidates = std::vector<std::string>;
class Hints {
struct ByLocation {
template <typename T, typename U>
bool operator()(T const& a, U const& b) const { return loc(a) < loc(b); }
private:
static Location loc(Source const& s) { return s.begin(); }
static Location loc(Location const& l) { return l; }
};
public:
std::map<Location, std::string, ByLocation> incomplete;
std::map<Source, Candidates, ByLocation> suggestions;
/*explicit*/ operator bool() const { return incomplete.size() || suggestions.size(); }
};
}
#include <boost/variant.hpp>
namespace Ast {
using NumLiteral = double;
using StringLiteral = std::string; // de-escaped, not source view
struct Identifier : std::string {
using std::string::string;
using std::string::operator=;
};
struct BinaryExpression;
struct CallExpression;
using Expression = boost::variant<
NumLiteral,
StringLiteral,
Identifier,
boost::recursive_wrapper<BinaryExpression>,
boost::recursive_wrapper<CallExpression>
>;
struct BinaryExpression {
Expression lhs;
char op;
Expression rhs;
};
using ArgList = std::vector<Expression>;
struct CallExpression {
Identifier function;
ArgList args;
};
}
#include <boost/fusion/adapted/struct.hpp>
BOOST_FUSION_ADAPT_STRUCT(Ast::BinaryExpression, lhs, op, rhs)
BOOST_FUSION_ADAPT_STRUCT(Ast::CallExpression, function, args)
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/phoenix.hpp>
#include <boost/spirit/repository/include/qi_distinct.hpp>
// for debug printing:
namespace {
struct once_t { // an auto-reset flag
operator bool() { bool v = flag; flag = false; return v; }
bool flag = true;
};
}
// for debug printing:
namespace Ast {
static inline std::ostream& operator<<(std::ostream& os, std::vector<Expression> const& args) {
os << "(";
once_t first;
for (auto& a : args) os << (first?"":", ") << a;
return os << ")";
}
static inline std::ostream& operator<<(std::ostream& os, BinaryExpression const& e) {
return os << boost::fusion::as_vector(e);
}
static inline std::ostream& operator<<(std::ostream& os, CallExpression const& e) {
return os << boost::fusion::as_vector(e);
}
}
namespace Parsing {
namespace qi = boost::spirit::qi;
namespace phx = boost::phoenix;
template <typename It>
struct Expression : qi::grammar<It, Ast::Expression()> {
Expression(Completion::Hints& hints) : Expression::base_type(start), _hints(hints) {
using namespace qi;
start = skip(space) [expression];
expression = term [_val = _1] >> *(char_("-+") >> expression) [_val = make_binary(_val, _1, _2)];
term = factor [_val = _1] >> *(char_("*/") >> term) [_val = make_binary(_val, _1, _2)];
factor = simple [_val = _1] >> *(char_("^") >> factor) [_val = make_binary(_val, _1, _2)];
simple = call | variable | compound | number | string;
auto implied = [=](char ch) {
return copy(omit[lit(ch) | raw[eps][imply(_1, ch)]]);
};
variable = maybe_known(phx::ref(_variables));
compound %= '(' >> expression >> implied(')');
// The heuristics:
// - an unknown identifier followed by (
// - an unclosed argument list implies )
call %= ( known(phx::ref(_functions)) // known -> imply the parens
| &(identifier >> '(') >> unknown(phx::ref(_functions))
)
>> implied('(')
>> -(expression % (',' | !(')'|eoi) >> implied(',')))
>> implied(')')
;
// lexemes, primitive rules
identifier = raw[(alpha|'_') >> *(alnum|'_')];
// imply the closing quotes
string %= '"' >> *('\\' >> char_ | ~char_('"')) >> implied('"'); // TODO more escapes
number = double_; // TODO integral arguments
///////////////////////////////
// identifier loopkup, suggesting
{
maybe_known = known(_domain) | unknown(_domain);
// distinct to avoid partially-matching identifiers
using boost::spirit::repository::qi::distinct;
auto kw = distinct(copy(alnum | '_'));
known = raw[kw[lazy(_domain)]];
unknown = raw[identifier[_val=_1]] [suggest_for(_1, _domain)];
}
BOOST_SPIRIT_DEBUG_NODES(
(start)
(expression)(term)(factor)(simple)(compound)
(call)(variable)
(identifier)(number)(string)
//(maybe_known)(known)(unknown) // qi::symbols<> non-streamable
)
_variables += "foo", "bar", "qux";
_functions += "print", "sin", "tan", "sqrt", "frobnicate";
}
private:
Completion::Hints& _hints;
using Domain = qi::symbols<char>;
Domain _variables, _functions;
qi::rule<It, Ast::Expression()> start;
qi::rule<It, Ast::Expression(), qi::space_type> expression, term, factor, simple;
// completables
qi::rule<It, Ast::Expression(), qi::space_type> compound;
qi::rule<It, Ast::CallExpression(), qi::space_type> call;
// implicit lexemes
qi::rule<It, Ast::Identifier()> variable, identifier;
qi::rule<It, Ast::NumLiteral()> number;
qi::rule<It, Ast::StringLiteral()> string;
// domain identifier lookups
qi::_r1_type _domain;
qi::rule<It, Ast::Identifier(Domain const&)> maybe_known, known, unknown;
///////////////////////////////
// binary expression factory
struct make_binary_f {
Ast::BinaryExpression operator()(Ast::Expression const& lhs, char op, Ast::Expression const& rhs) const {
return {lhs, op, rhs};
}
};
boost::phoenix::function<make_binary_f> make_binary;
///////////////////////////////
// auto-completing incomplete expressions
struct imply_f {
Completion::Hints& _hints;
void operator()(boost::iterator_range<It> where, char implied_char) const {
auto inserted =
_hints.incomplete.emplace(&*where.begin(), std::string(1, implied_char));
// add the implied char to existing completion
if (!inserted.second)
inserted.first->second += implied_char;
}
};
boost::phoenix::function<imply_f> imply { imply_f { _hints } };
///////////////////////////////
// suggest_for
struct suggester {
Completion::Hints& _hints;
void operator()(boost::iterator_range<It> where, Domain const& symbols) const {
using namespace Completion;
Source id(&*where.begin(), where.size());
Candidates c;
symbols.for_each([&](std::string const& k, ...) { c.push_back(k); });
auto score = [id](Source v) { return fuzzy_match(id, v); };
auto byscore = [=](Source a, Source b) { return score(a) > score(b); };
sort(c.begin(), c.end(), byscore);
c.erase( remove_if(c.begin(), c.end(), [=](Source s) { return score(s) < 3; }), c.end());
_hints.suggestions.emplace(id, c);
}
};
boost::phoenix::function<suggester> suggest_for {suggester{_hints}};
};
}
#include <iostream>
#include <iomanip>
int main() {
for (Source const input : {
"", // invalid
"(3", // incomplete, imply ')'
"3*(6+sqrt(9))^7 - 1e8", // completely valid
"(3*(((6+sqrt(9))^7 - 1e8", // incomplete, imply ")))"
"print(\"hello \\\"world!", // completes the string literal and the parameter list
"foo", // okay, known variable
"baz", // (suggest bar)
"baz(", // incomplete, imply ')', unknown function
"taz(", // incomplete, imply ')', unknown function
"san(", // 2 suggestions (sin/tan)
"print(1, 2, \"three\", complicated(san(78",
"(print sqrt sin 9) -0) \"bye",
})
{
std::cout << "-------------- '" << input << "'\n";
Location f = input.begin(), l = input.end();
Ast::Expression expr;
Completion::Hints hints;
bool ok = parse(f, l, Parsing::Expression<Location>{hints}, expr);
if (hints) {
std::cout << "Input: '" << input << "'\n";
}
for (auto& c : hints.incomplete) {
std::cout << "Missing " << std::setw(c.first - input.begin()) << "" << "^ implied: '" << c.second << "'\n";
}
for (auto& id : hints.suggestions) {
std::cout << "Unknown " << std::setw(id.first.begin() - input.begin()) << "" << std::string(id.first.size(), '^');
if (id.second.empty())
std::cout << " (no suggestions)\n";
else {
std::cout << " (did you mean ";
once_t first;
for (auto& s : id.second)
std::cout << (first?"":" or ") << "'" << s << "'";
std::cout << "?)\n";
}
}
if (ok) {
std::cout << "AST: " << expr << "\n";
} else {
std::cout << "Parse failed\n";
}
if (f!=l)
std::cout << "Remaining input: '" << std::string(f,l) << "'\n";
}
}