Search code examples
flex-lexer

Flex lexical analyzer not behaving as expected


I'm trying to use Flex to match basic patterns and print something.

%%
 
^[^qA-Z]*q[a-pr-z0-9]*4\n           {printf("userid1, userid2  \n"); return 1;}

%% 
int yywrap(void){return 1;}

int main( int argc, char **argv )
             {
             ++argv, --argc;  /* skip over program name */
             if ( argc > 0 )
                     yyin = fopen( argv[0], "r" );
             else
                     yyin = stdin;

             while (yylex());
             }

Resolved dumb question


Solution

  • I don't know what you are trying to do, so I'll focus on the immediate issue, which is your last pattern:

    ^[^qA-Z]*q[a-pr-z0-9]*4[a-pr-z0-9]*4[a-pr-z0-9]*\n
    

    That pattern starts by matching [^qA-Z]*, which is any number of anything which is not a q nor a capital letter (A-Z). Then it matches a q.

    Here it's worth considering all the things which are not a q nor a capital letter (A-Z). Obviously, that includes lower-case letters such as s (other than q). It also includes digits. And it includes any other character: punctuation, whitespace, even control characters. In particular, it includes a newline character.

    So when you type

    10s10<newline>
    

    That certainly could be the start of the last pattern. The scanner hasn't yet seen a q so it doesn't know whether the pattern will eventually match, but it hasn't yet failed. So it keeps on reading more characters, including more newlines.

    When you eventually type a q, the scanner can continue with the rest of the pattern. Depending on what you type next, it might or might not be able to continue. If, as seems likely, your input eventually fails to match the pattern, the lexer will fall back to the longest successful match, which is the first pattern. At that point, it will perform the first action

    Negative character classes need to be used with a bit of caution. It's s easy to fall into the trap of thinking that "not ..." only includes "reasonable" input. But it includes everything. Often, as in this case, you'll want to at least exclude newlines.,