Search code examples
cflex-lexer

Flex: REJECT rejects one character at a time?


I'm parsing C++-style scoped names, e.g., A::B. I want to parse such a name as a single token for a type name if it was previously declared as a type; otherwise, I want to parse it as three tokens A, ::, and B.

Given:

L             [A-Za-z_]
D             [0-9]
S             [ \f\r\t\v]

identifier    {L}({L}|{D})*
sname         {identifier}({S}*::{S}*{identifier})+

%%

{sname}       {
                if ( find_type( yytext ) )
                  return Y_TYPE_NAME;
                REJECT;
              }

{identifier}  {
                // ...
                return Y_NAME;
              }

However, for the input sequence:

A::BB -> Not a type; returned as "A", "::", "BB"
(Declare A::B as a type.)
A::BB

What happens on the second parse of A::BB, REJECT is called and flex discards only 1 character from the end and tries to match A::B (one B). This matches the previously declared A::B in the {sname} rule which is wrong.

What I assumed REJECT did was to proceed to the second-best matching rule with the same input. Hence, I expected it to match A alone in {identifier} and just leave ::BB on the input stream. But, as shown, that's not what happens. It peels off one character at a time from the end of the input and re-attempts a match.

Adding in yyless() to chop off the ::BB doesn't help:

{sname}       {
                if ( find_type( yytext ) )
                  return Y_TYPE_NAME;
                char const *const sep = strpbrk( yytext, ": \t" );
                size_t keep_len = (size_t)(sep - yytext);
                yyless( keep_len );
                REJECT;
              }

The only thing that I've found that works is:

{sname}       {
                if ( find_type( yytext ) )
                  return Y_TYPE_NAME;
                char const *const sep = strpbrk( yytext, ": \t" );
                size_t keep_len = (size_t)(sep - yytext);
                yyless( keep_len );
                goto identifier;
              }

{identifier}  {
    identifier:
                // ...
                return Y_NAME;
              }

But these seems kind of hack-ish. Is there a more canonical "flex way" to do what I want?

Despite the hack-ish-ness, is there actually anything wrong with my solution? I.e., would it not work in some cases? Or not work in some versions of Flex?


Yes, I'm aware that, even if it did work the way I want, this won't parse all contrived C++ scoped names; but it's good enough.

Yes, I'm aware that generally parsing such things should be done as separate tokens, but C-like languages are hard to parse since they're not context-free. Knowing when you have a type really helps.

Yes, I'm aware that REJECT slows down the lexer. This is (mostly) for an interactive and command-line tool in a terminal, so the human is the slowest component.

I'd like to focus on the problem at hand with the code as it mostly is. Mine is more of a question about how to use REJECT, yyless(), etc., to get the desired behavior. Thanks.


Solution

  • REJECT does not make any distinction between different rules; it just falls back to the next possible accepting pattern (which might not even be shorter, if there's a lower-precedence rule which matches the same token.) That might be a shorter match of the same pattern. (Normally, Flex chooses the longest match out of the possible matches of the regular expression. With REJECT, the shorter matches are also considered.)

    So you can avoid the false match of A::B for input A::BB by using trailing context: [Note 1]

    {sname}/[^[:alnum:]_]  {...}
    

    (In Flex, you can't put trailing context in a macro because the expansion is surrounded with parentheses.)

    You could use that solution if you wanted to try all possible complete prefixes of id1::id2::id3::id4 starting with the longest one (id1::id2::id3::id4) and falling back to each shorter prefix in turn (id1::id2::id3, id1::id2, id1). The trailing context prevents intermediate matches in the middle of an identifier.

    I'm not sure if that is the problem you are trying to solve, though. And even if it is, it doesn't really solve the problem of what to do after you fall back to an acceptable solution, because you probably don't want to redo the search for prefixes in the middle of the sequence. In other words, if the original were A::B::A::B, and A::B is a "known prefix", the tokens returned should (presumably) be A::B, ::, A, ::, B rather than A::B, ::, A::B.

    On the other hand, you might not care about previously defined prefixes; only whether the complete sequence is a known type. In that case, the possible token sequences for A::B::C::D are [TYPENAME(A::B::C::D)] and [ID(A),, ::, ID(B), ::, ID(C), ID(D)]. For this case, you don't want to restrict the {sname} fallbacks; you want to eliminate them. After the fallback, you only want to try matches for different rules.

    There are a variety of alternative solutions (which do not use REJECT), mostly relying on the possibility of using start conditions. (You still need to think about what happens after the fallback. Start conditions can be useful for that, as well.)

    One fairly general solution is to define a start condition which excludes the rule you want to fall back from. Once you've determined that you want to fall back to a different rule, you change to a start condition which excludes the rule which just matched and then call yyless(0) (without returning). That will cause Flex to rescan the current token from the beginning in the new start condition. (If you have several possible rules which might match, you will need several start conditions. This could get out of hand, but in most cases the possible set of matching rules is very limited.)

    At some point, you need to return to the previous start condition. (You could use a start condition stack if it's not trivial to figure out which was the previous start condition.) Ideally, to avoid false interior matches, you would want to return to the previous start condition only when you reached the end of the first (incorrect) match. That's easy to do if you are tracking the input position for each token; you just need to save the position just before calling yyless(0). (You also need to correct the token input position by subtracting yyleng before yyless sets yyleng to 0).

    Rescanning from the beginning of the token might seem inefficient, but it's less inefficient than the overhead imposed by REJECT. And the overhead of REJECT affects the entire scanner operation, while the rescanning solution is essentially free except for the tokens you happen to rescan. [Note 2]

    (BEGIN(some_state); yyless(0);) is a reasonably common flex idiom; this is not the only use case. Another one is the answer to the question "How do I run a start condition until I reach a token I can't identify without consuming that token?")

    But I think that in this case there is a simpler solution, using yymore to accumulate the token. (This avoids having to do your own dynamically expanded token buffer.) Again, there are two possibilities, depending on whether you might allow initial prefix matches or restrict the possibilities to either the full sequence or the first identifier.

    But the outline is the same: Match the shortest valid possibility first, remember how long the token was at that point, and then use a start condition to enable a pattern which extends the token, either to the next possibility or to the end of the sequence, depending on your needs. Before continuing with the scanner loop, you call yymore() which indicates to Flex that the next pattern extends the token rather than replacing it.

    At each possible match end, you test to see if that would really be a valid token and if so, record the position (and whatever else you might need to recall, such as the token type). When you reach a point where you can no longer extend the match, you use yyless to fall back to the last valid match point, and return the token.

    This is slightly less inefficient than the pure yyless() solution because it avoids the rescan of the token which is returned. (All these solutions, including REJECT, do rescan the text following the selected match if it is shorter than the longest possible extended match. That's theoretically avoidable but since it's not a lot of overhead, it doesn't seem worthwhile to build a complex mechanism to avoid it.) Again, you probably want to avoid trying to extend token matches after the fallback until you reach the longest extent. This can be solved the same way, by recording the longest matched extent, but the start condition handling is a bit simpler.

    Here's some not-very-well tested code for the simpler problem, where only the first identifier and the full match are possible:

    %{
        /* Code to track current token position and the fallback position.
         * It's a simple byte count; line-ends are not dealt with.
         */
        static int token_pos = 0;
        static int token_max_pos = 0;
        /* This is done before every action, even empty actions */
        #define YY_USER_ACTION token_pos += yyleng;
        /* FALLBACK needs to undo YY_USER_ACTION */
        #define FALLBACK(to) do { token_pos -= yyleng - to; yyless(to); } while (0)
        /* SET_MORE needs to pre-undo the next YY_USER_ACTION */
        #define SET_MORE(X) do { token_pos -= yyleng; yymore(); } while(0)
    %}
    
    %x EXTEND_ID
    ident [[:alpha:]_][[:alnum:]_]*
    %%
        int fallback_leng = 0;
        /* The trailing context here is to avoid triggering EOF in
         * the EXTEND_ID state.
         */
    {ident}/[^[:alnum:]_] {
                  /* In the fallback region, don't attempt to extend the match. */
                  if (token_pos <= token_max_pos)
                    return Y_IDENT;
                  fallback_leng = yyleng;
                  BEGIN(EXTEND_ID);
                  SET_MORE(yymore);
                }
    {ident}     { return find_type(yytext) ? Y_TYPE_NAME : Y_IDENT; }
    <EXTEND_ID>{
      ([[:space:]]*"::"[[:space:]]*{ident})*|.|\n {
                  BEGIN(INITIAL);
                  if (yyleng == fallback_leng + 1)
                    FALLBACK(fallback_leng);
                  if (find_type(yytext))
                    return Y_TYPE_NAME;
                  else {
                    FALLBACK(fallback_leng);
                    return Y_IDENT;
                  }
                }
    }
    

    Notes

    1. At least, I think you can do that. I haven't ever tried and REJECT does impose a number of limitations on other scanner features. Really, it's a historical artefact which massively slows down lexical analysis and should generally be avoided.

    2. The use of REJECT anywhere in the rules causes Flex to switch to a different template in which a list of endpoints is retained, which has a cost in both time and space. Another consequence is that Flex can no longer resize the input buffer.