Search code examples
tokenizeflex-lexerlexical-analysis

Order of precedence for token matching in Flex


My apologies if the title of this thread is a little confusing. What I'm asking about is how does Flex (the lexical analyzer) handle issues of precedence?

For example, let's say I have two tokens with similar regular expressions, written in the following order:

"//"[!\/]{1}    return FIRST;
"//"[!\/]{1}\<  return SECOND;

Given the input "//!<", will FIRST or SECOND be returned? Or both?

The FIRST string would be reached before the SECOND string, but it seems that returning SECOND would be the right behavior.


Solution

  • The longest match is returned.

    From flex & bison, Text Processing Tools:

    How Flex Handles Ambiguous Patterns

    Most flex programs are quite ambiguous, with multiple patterns that can match the same input. Flex resolves the ambiguity with two simple rules:

    • Match the longest possible string every time the scanner matches input.
    • In the case of a tie, use the pattern that appears first in the program.

    You can test this yourself, of course:

    file: demo.l

    %%
    "//"[!/]   {printf("FIRST");}
    "//"[!/]<  {printf("SECOND");}
    %%
    
    int main(int argc, char **argv)
    {
        while(yylex() != 0);
        return 0;
    }
    

    Note that / and < don't need escaping, and {1} is redundant.

    bart@hades:~/Programming/GNU-Flex-Bison/demo$ flex demo.l 
    bart@hades:~/Programming/GNU-Flex-Bison/demo$ cc lex.yy.c  -lfl
    bart@hades:~/Programming/GNU-Flex-Bison/demo$ ./a.out < in.txt 
    SECOND
    

    where in.txt contains //!<.