Search code examples
regexflex-lexerlex

Unrecognized rule in flex - negative lookahead


When I run flex on this code it complains about unrecognized rule. I want to match strings like (b|B)^n(a|A)^m such that n >= 4 and m <= 3. I tested the regex on regex101 and it works fine. Here is the code:

%{
    #include <stdio.h>
    ...
%}
* some rules *
STRING  ([bB]{4,}[aA]{1,3}(?!(a|A)+))+     // rule causing error
%%
...
{STRING} {      
            printf("%s ", yytext);
         }

Edit: Whole strings should be matched and allowed separators are whitespace and\or comma. Substrings should not be matched.


Solution

  • (F)lex does not implement negative lookahead assertions. (Note: preceding link is not an endorsement.)

    You can find a complete list of the regular expression operators accepted by flex in the flex manual; if a syntax is not in that list, it will not be recognized regardless of what online regex services tell you. (Note: preceding link is an endorsement.)

    (F)lex does implement a positive lookahead assertion, but only at the very end of the pattern. This is indicated with the trailing context operator /. You could use that operator to recognize your token by requiring that it be followed by something other than an A:

    [bB]{4,}[aA]{1,3}/[^Aa]  { printf("%s ", yytext); }
    

    But that is not quite the same semantics, because it won't recognise the token at the very end of the input. It requires that the token be followed by something which is not an A; begin followed by nothing does not count. (In practice, this might not make much difference. If you are scanning input from a text stream, you can reasonably expect the stream to have a newline as the last character, and the newline character will match [^Aa]. However, if you intend to scan text strings which might not have newlines at all, then you need to be aware of the issue.)

    Most of the time, this is not really what you want. Or if it really is what you want, then (f)lex is probably not a good match for your use case.

    (F)lex is designed to divide an input into consecutive tokens. It does not search for tokens; it identifies the token at the current input point. It expects that the entire input will consist of tokens, so some pattern needs to match at each point.

    On that basis, you need to think about what kind of token a non-matching sequence is. Take, for example:

    bbbbbbbaaaa
    

    That has too many as to be a "string" with your rule. So what is it?

    1. The valid token bbbbbbbaaa followed by another token starting with a?

    2. A valid token which matches some other pattern? (Eg. LONG_STRING)?

    3. An invalid token which should be ignored, allowing the scan to continue?

    4. An unrecoverable error?

    All of those cases can be handled without using any lookaround operators.

    In the first case, it is sufficient to use a regular expression which matches a valid token:

    [bB]{4,}[aA]{1,3}     { printf("Valid STRING: %s ", yytext); }
    

    In the second case, you can rely on the (f)lex maximum munch matching rule, which states that the pattern corresponding to the longest match will be used:

    [bB]{4,}[aA]{1,3}     { printf("Valid STRING: %s ", yytext); }
    [bB]{4,}[aA]{4,}      { printf("Valid LONG STRING: %s ", yytext); }
    

    That can be simplified:

    [bB]{4,}[aA]{1,3}     { printf("Valid STRING: %s ", yytext); }
    [bB]{4,}[aA]          { printf("Valid LONG STRING: %s ", yytext); }
    

    This will have the same effect because the (f)lex rule to decide between two patterns which both have the longest match is to use the first of the patterns in the input file. So the first six characters of bbbbaa--- matches both patterns and thus the first one wins, while bbbbaaaa--- has a seven-character match with the first pattern and an eight-character match with the second one, so the second one wins.

    For the third and fourth cases, the above pair of patterns can be used as well; the only difference is in the action corresponding to the second pattern. For case 3: ignore the token, possibly issuing a warning; for case 4: produce an error message and terminate the scan.