Search code examples
c++parsingboostboost-spirit

Boost Spirit Qi parser does not consume whole string expression?


Assuming I have the following rule:

identifier %= 
        lexeme[
            char_("a-zA-Z")
            >> -(*char_("a-zA-Z_0-9")
            >> char_("a-zA-Z0-9"))
        ]
        ;

qi::rule<Iterator, std::string(), Skipper> identifier;

and the following input:

// identifier
This_is_a_valid123_Identifier

As the traces show the identifier is parsed properly and the attributes are set but the skipper starts one char after the first character of the string again:

<identifier>
  <try>This_is_a_valid123_I</try>
  <skip>
    <try>This_is_a_valid123_I</try>
    <emptylines>
      <try>This_is_a_valid123_I</try>
      <fail/>
    </emptylines>
    <comment>
      <try>This_is_a_valid123_I</try>
      <fail/>
    </comment>
    <fail/>
  </skip>
  <success>his_is_a_valid123_Id</success>
  <attributes>[[T, h, i, s, _, i, s, _, a, _, v, a, l, i, d, 1, 2, 3, _, I, d, e, n, t, i, f, i, e, r]]</attributes>
</identifier>
<skip>
  <try>his_is_a_valid123_Id</try>
  <emptylines>
    <try>his_is_a_valid123_Id</try>
    <fail/>
  </emptylines>
  <comment>
    <try>his_is_a_valid123_Id</try>
    <fail/>
  </comment>
  <fail/>
</skip>

I've already tried to use as_string in the lexeme expression which did not help.


Solution

  • I don't see why you complicate the expression. Can you try

    identifier %= 
                    char_("a-zA-Z")
                >> *char_("a-zA-Z_0-9")
            ;
    
    qi::rule<Iterator, std::string()> identifier;
    

    This is about the most standard expression you can get. Even if you don't want to allow identifiers ending in _ I'm quite sure you don't want such a trailing _ to be parsed as 'the next token'. In such a case, I'd just add validation after the parse.

    Update To the comment:

    Here is the analysis:

    • First off: -(*x) is a red flag. It is never a useful pattern as *x already matches an empty sequence, you can't make it "more optional"

      (in fact, if *x was made to allow partial backtracking as in regular expression, you'd likely have seen exponential performance or even infite runtime; "luckily", *x is always greedy in Spirit Qi).

    This indeed facilitates your bug. Let's look at your parser expression in the OP as lines 1, 2, 3.

    • First, Line 1 matches T.
    • The second line initially greedily matches his_is_a_valid123_Identifier.
    • But that cannot satisfy the third line, so the -(...) kicks in and everything after line 1 is backtracked.
    • However: Qi

      • does backtrack the cursor (current input iterator) but
      • does not by default rollback changes to container attributes.

      Yes. You guessed it. std::string is a container attribute.

    So what you end up is a succeeded match with length 1 and residu of a failed optional sequence in the attribute.

    Some other backgrounders on how to resolve this kind of backtracking issue: