Search code examples
c++boostboost-spiritboost-spirit-qi

Boost Spirit 2 - Symbol expansion and backtracking


I am having some issues specifying and parsing a rather simple grammar.

vertex = char+
edge = vertex " -> " vertex
start = ((vertex | edge) eol)*

input = "a\nb\na -> b\n"

Spirit is doing the following:

"a" -> vertex
"\n" -> eol
-> start

"b" -> vertex
"\n" -> eol
-> start

and

"a" -> vertex
terminate

instead of identifying the edge in the end and parsing the entire input. That is, it could parse the entire input but it's not. Shouldn't it backtrack and attempt to parse using the alternate rule? And thus complete the start rule.

Is it because Spirit uses PEGs? (http://www.boost.org/doc/libs/1_57_0/libs/spirit/doc/html/spirit/abstracts/parsing_expression_grammar.html#spirit.abstracts.parsing_expression_grammar.alternatives)

Minimal working example:

#include <cstdio>
#include <string>
#include <boost/spirit/include/qi.hpp>

namespace qi = boost::spirit::qi;

void print(const std::string & str) {
  printf("%s\n", str.c_str());
}

void print_eol() {
  printf("eol\n");
}

int main() {
  std::string str = "a\nb\na -> b\n";
  std::string::iterator it = str.begin(), begin = str.begin(), end = str.end();

  qi::rule<std::string::iterator, std::string()> vertex = +qi::alnum;
  qi::rule<std::string::iterator, std::string()> edge = vertex >> " -> " >> vertex;
  qi::rule<std::string::iterator> start = (vertex[&print] | edge[&print]) % qi::eol[&print_eol];

  bool matched = parse(it, end, start);

  if (matched) {
    printf("matched\n");
  }
  if (it == end) {
    printf("all\n");
  } else if (it != begin) {
    printf("some\n");
  } else {
    printf("none\n");
  }

  return 0;
}

Output:

$ ./a.out
a
eol
b
eol
a
matched
some

I'm using Boost 1.57.0 and Clang 3.5.1 on MSYS2.


Solution

  • It seems that you answered your own question: yes, it's because PEG grammars are inherently greedy, left-to-right matching.

    Notes

    • printing from a semantic action is not a fair test, because even when the parser does backtrack (it can continue the next alternative branch on failure of a single expression) the side effect will already have fired. This is a major draw back of using Semantic Actions - unless you're careful (Boost Spirit: "Semantic actions are evil"?)

    • BOOST_SPIRIT_DEBUG and BOOST_SPIRIT_DEBUG_NODES macros are there fopr this purpose and give a lot more information

    • The obvious solutions here would be

      1. reorder the branches:

        start = (edge[&print] | vertex[&print]) % qi::eol[&print_eol];
        

        Live On Coliru

        a
        eol
        b
        eol
        a -> b
        matched all
        
      2. left - factorisation:

        start = (vertex[print] >> -(" -> " >> vertex)[print]) % qi::eol[&print_eol];
        

        Live On Coliru

        a
        eol
        b
        eol
        a
        b
        matched all
        

    If you want some real-life ideas beyond satisfying your curiosity about PEG grammars, you can search SO answers for ideas how to parse into separate collections of vertices/edges in roughly a dozen answers on SO, as I remember.