Parsing white-spaces in between lexemes using boost-spirit

I want to parse a bnf grammar using boost::spirit. This parser works fine. However, I also want to be able read white-spaces that occur in between lexemes. For example, suppose I have a grammar like this:

<name> ::= <firtname> <surname>
<firtname> ::= <char><char> | <firstname><char>
<surname> ::= <char><char> | <surname><char>
<char>   ::= a | b | c ... | z

Suppose I have a rewriting system that uses the above grammar, I should have at the end for <name> something like David Harvey as the output. However if the <name> rule was written like this <name> ::= <firtname><surname>. The rewriting system should give an output like this DavidHarvey. This is because the rewriting system is white-space sensitive.

Solution

Generation is a fundamentally different job than parsing.

Parsing removes redundancy and normalizes data. Generation adds redundancy and chooses (one of typically many) representations according to some goals (stylistic guides, efficiency goals etc).

By allowing yourself to get side-tracked with the BNF similarity, you've lost sight of your goals. As, in BNF many instances of whitespace are simply not significant.

This is manifest in the direct observation that the AST does not contain the whitespace.

Hacking It

The simplest way would be to represent the skipped whitespace instead as "string literals" inside your AST:

    _term       = _literal | _rule_name | _whitespace;

With

    _whitespace = +blank;

And then making the _list rule a lexeme as well (so as to not skip blanks):

    // lexemes
    qi::rule<Iterator, Ast::List()>   _list;
    qi::rule<Iterator, std::string()> _literal, _whitespace;

See it Live On Compiler Explorer

Clean Solution

The above leaves a few "warts": there are spots where whitespace is still not significant (namely around | and specifically before the list-attribute numbers):

<code>   ::=  <letter><digit> 34 | <letter><digit><code> 23
<letter> ::= "a" 1 | "b" 2 | "c" 3 | "d" 4 | "e" 5 | "f" 6 | "g" 7 | "h" 8 | "i" 9
<digit>  ::= "9" 10 | "1" 11 | "2" 12 | "3" 13 | "4" 14

I don't see how it would usefully be significant there, unless of course your input doesn't look like the input you've been using. E.g. if it looks like this instead:

<code>::=<letter><digit>34|<letter><digit><code>23
<letter>::="a"1|"b"2|"c"3|"d"4|"e"5|"f"6|"g"7|"h"8|"i"9
<digit>::="9"10|"1"11|"2"12|"3"13|"4"14

You could make all the rules lexeme. However, this doesn't add up with the presence of quoted strings, at all. The whole notion of quoted strings is to mark regions where normal whitespace (and comment) skipping is suspended.

I have a nagging feeling that you are much farther away from your actual problem (see https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem) than we can even currently see, and you might even have stripped the whole quoted-string-literals concept from the "BNF" already.

A clean solution would be to forget about misleading similarities with BNF and just devise your own grammar from the ground up.

If the goal is simply to have a (recursive) macro/template expansion engine, it should really turn out a lot simpler than what you currently have. Maybe you can describe your real task (input, desired output and required behaviors) so we can help you achieve that?