Search code examples
c++flex-lexerlexlexical-analysis

flex filename.l won't produce lex.yy.c


I want to detect if there are any duplicates of uppercase letters from input. The program should work fine but when i try to run the command flex filename.l on cmd windows it won't show any errors or warnings. The command runs and i have to wait until it outputs lex.yy.c but that never happens. I'm still waiting for like an hour to finish. Why does this happen? This is the part of the code for detecting duplicates.

// Definition Section
A_ ([B-L]*"A"[B-L]*("A"[B-L]*)+)
B_ ([ACDEFGHIJKL]*"B"[ACDEFGHIJKL]*("B"[ACDEFGHIJKL]*)+)
C_ ([ABDEFGHIJKL]*"C"[ABDEFGHIJKL]*("C"[ABDEFGHIJKL]*)+)
D_ ([ABCEFGHIJKL]*"D"[ABCEFGHIJKL]*("D"[ABCEFGHIJKL]*)+)
E_ ([ABCDFGHIJKL]*"E"[ABCDFGHIJKL]*("E"[ABCDFGHIJKL]*)+)
F_ ([ABCDEGHIJKL]*"F"[ABCDEGHIJKL]*("F"[ABCDEGHIJKL]*)+)
G_ ([ABCDEFHIJKL]*"G"[ABCDEFHIJKL]*("G"[ABCDEFHIJKL]*)+)
H_ ([ABCDEFGIJKL]*"H"[ABCDEFGIJKL]*("H"[ABCDEFGIJKL]*)+)
I_ ([ABCDEFGHJKL]*"I"[ABCDEFGHJKL]*("I"[ABCDEFGHJKL]*)+)
J_ ([ABCDEFGHIKL]*"J"[ABCDEFGHIKL]*("J"[ABCDEFGHIKL]*)+)
K_ ([ABCDEFGHIJL]*"K"[ABCDEFGHIJL]*("K"[ABCDEFGHIJL]*)+)
L_ ([ABCDEFGHIJK]*"L"[ABCDEFGHIJK]*("L"[ABCDEFGHIJK]*)+)
%%//Rule Section
{A_} {printf("Letter 'A' appeared more than once!");}
{B_} {printf("Letter 'B' appeared more than once!");}
{C_} {printf("Letter 'C' appeared more than once!");}
{D_} {printf("Letter 'D' appeared more than once!");}
{E_} {printf("Letter 'E' appeared more than once!");}
{F_} {printf("Letter 'F' appeared more than once!");}
{G_} {printf("Letter 'G' appeared more than once!");}
{H_} {printf("Letter 'H' appeared more than once!");}
{I_} {printf("Letter 'I' appeared more than once!");}
{J_} {printf("Letter 'J' appeared more than once!");}
{K_} {printf("Letter 'K' appeared more than once!");}
{L_} {printf("Letter 'L' appeared more than once!");}

Solution

  • There are really two separate issues here. One is the question you ask –why does flex take so long to compile this scanner– and the other has to do with whether the scanner description accurately reflects your intent. The second question is a bit tricky because you don't give a very precise description of what you intended, but I'll try to consider some plausible possibilities, at the end.

    To start with, it's useful to describe what your scanner description actually does. That's also a bit tricky, because you have only included a part of your scanner, so I'm going to take the liberty of projecting a complete scanner which might at least be an illustration of what the intention might have been. The code I'm going to work with is the following:

    Note: The options prevent compiler warnings, remove the need for yywrap, and warn if the rules don't cover all possible inputs. The fallback patterns at the end and the brief definition of main make the scanner runnable. {-} is a Flex extension which computes the set difference between two character classes.

      /* My standard 
       */
    %option noinput nounput noyywrap nodefault
    NOT_A  [A-L]{-}[A]
    NOT_B  [A-L]{-}[B]
    NOT_C  [A-L]{-}[C]
    NOT_D  [A-L]{-}[D]
    NOT_E  [A-L]{-}[E]
    NOT_F  [A-L]{-}[F]
    NOT_G  [A-L]{-}[G]
    NOT_H  [A-L]{-}[H]
    NOT_I  [A-L]{-}[I]
    NOT_J  [A-L]{-}[J]
    NOT_K  [A-L]{-}[K]
    NOT_L  [A-L]{-}[L]
    %%
    {NOT_A}*A{NOT_A}*(A{NOT_A}*)+   { puts("Duplicate A"); }
    {NOT_B}*B{NOT_B}*(B{NOT_B}*)+   { puts("Duplicate B"); }
    {NOT_C}*C{NOT_C}*(C{NOT_C}*)+   { puts("Duplicate C"); }
    {NOT_D}*D{NOT_D}*(D{NOT_D}*)+   { puts("Duplicate D"); }
    {NOT_E}*E{NOT_E}*(E{NOT_E}*)+   { puts("Duplicate E"); }
    {NOT_F}*F{NOT_F}*(F{NOT_F}*)+   { puts("Duplicate F"); }
    {NOT_G}*G{NOT_G}*(G{NOT_G}*)+   { puts("Duplicate G"); }
    {NOT_H}*H{NOT_H}*(H{NOT_H}*)+   { puts("Duplicate H"); }
    {NOT_I}*I{NOT_I}*(I{NOT_I}*)+   { puts("Duplicate I"); }
    {NOT_J}*J{NOT_J}*(J{NOT_J}*)+   { puts("Duplicate J"); }
    {NOT_K}*K{NOT_K}*(K{NOT_K}*)+   { puts("Duplicate K"); }
    {NOT_L}*L{NOT_L}*(L{NOT_L}*)+   { puts("Duplicate L"); }
      /* Match any string consisting of A-L */
    [A-L]+                          { puts("No duplicate"); }
      /* Ignore newline or any other character */
    .|\n                            ;
    %%
    int main(void) {
      return yylex();
    }
    

    It's important to underline the fact that the point of a scanner is to divide the input into tokens. A scanner is not a general purpose regular expression engine, although it is sometimes possible to push the boundaries a bit. However, if you want to do a task like searching the input for possibly overlapping regular expression matches, ignoring unmatched input, you might find that Flex's sequential match architecture is not a good match for the problem.

    So, in this case, I assumed that the strings of interest are sequences of upper case letters (with a restricted alphabet) and that anything else can be safely ignored. (That means that the input abcABCD993 will be divided into seven tokens, of which six are ignored: the lower case letters at the beginning and the digits at the end. ABCD is considered a token even though it is not visibly delimited from the surrounding text. Fixing that, if it needs to be fixed, is not difficult, but is not relevant to this question.)

    Because I added the pattern [A-L]+ at the end, the rule set does match any sequence of upper case letters from the restricted alphabet. But only one rule is allowed to match for any such word. If there is a duplicate letter, one of the first 12 rules will match. The [A-L]+ rule will also match, but since it is at the end, it will only be applied if no duplicate rule matches. Similarly, a token could have more than one duplicate letter (for example, CCLAAL matches the A, C and L duplicate rules). Because any rule which matches will match the entire token, the rule which applies is the first rule in the scanner description; the string CCLAAL will trigger the report Duplicate A.

    Innocuous though it seems, it is that resolution which causes the exponential blow-up in scanner size. Consider the example input CCLAAL once again. The CC at the beginning immediately puts the pattern {NOT_C}*C{NOT_C}*(C{NOT_C}*)+ into an accepting state, but the match does not terminate at that point; it can be extended. However, while the scanner continues to see whether there is a better match, it can't forget that Duplicate C is a possibility. Had the input been CCLAL, for example, Duplicate C would be the best match rather than Duplicate A. And it's not just full duplicate matches which need to be remembered. In the pattern ACLLCA, it's not known until the second A is reached that Duplicate A is correct. So at the point that the second L has been read, the scanner must remember not only that Duplicate L is a possible report, but also that a single A or a single C could change the report accordingly.

    If we were going to write a program to implement these semantic, the simplest way would be to keep a vector of integers representing the number of occurrences of each of the letters A through L. Such code might look something like this:

    int check_for_duplicate(void) {
      int ch;
      int seen['L' - 'A' + 1] = {0};
      while ((ch = getchar()) != EOF && (ch < 'A' || ch > 'L'))
        continue;
      if (ch == EOF) return EOF;
      seen[ch - 'A'] = 1;
      while ( (ch = getchar()) != EOF &&  >= 'A' && ch <= 'L') {
        ++seen[ch - 'A'];
      }
      /* Put the following character back into the stream */
      if (ch != EOF) ungetc(ch, stdin);
      /* See if a duplicate was found */
      for (int i = 0; i < sizeof(seen) / sizeof(*seen); ++i) {
        if (seen[i] >= 2) {
          printf("Duplicate %c\n", 'A' + i);
          return 1;
        }
      }
      puts("No duplicates");
      return 0;
    }
    

    In that program, the array seen maintains the state of the scan. Only three of each of the possible count values are actually important, so as a first approximation there are a total of 312 = 531,441 possible meaningful scanner states. That doesn't matter here, but it does matter to Flex; the scanner built by Flex is a state machine with no data other than the scanner state.

    That seems manageable, and even more so when we note that anything after the first 2 is irrelevant, reducing the count to 3*(212-1), or 12,287. Unfortunately, Flex is actually dealing with regular expressions and the reduction to duplicate counts is an abstraction which it can't make. We can see that there are only three relevant counts for each letter, but Flex is limited to noting how far it has progressed in the regular expression for each possible duplicate match. And there are more than three positions in the regular expression.

    If you run Flex with the -v option, it will print out a summary of statistics, one of which is the number of "DFA states" which it produced from the ruleset. The scanner Flex generates includes a state transition table whose size is related to the number of states, so that statistic is very important. I ran Flex with different numbers of duplicate match rules, from 2 to 9, and extracted the number of states in each state machine; as might be expected from the above, the state count is exponential in the number of rules:

    Rules   States
      2       28
      3       89
      4      306
      5     1063
      6     3656
      7    12405
      8    41566
      9   137795
    

    Based on that progression, it's reasonable to predict that the scanner for 12 letters would have somewhat more than 50 million states, and the source file which contains the state transitions will be measured in gigabytes. So it's not too surprising that Flex takes a while to generate it.

    What is to be done?

    Personally, I'd do this myself rather than trying to write a Flex pattern. You could use Flex to identify the token which you want to scan for duplicates and then scan it with a simple C function. You could make the scan faster by stopping at the first duplicate character in the token rather than attempting to find the duplicate which comes first alphabetically.

    If you really want to do this analysis in Flex, this behaviour, you could do it with multiple scans. Each scan would have its own start condition; you can try the next alternative by calling yyless(0) to return to the start of the token and then BEGIN(next_sc) to try the next alternative. But doing 12 (or 26) scans is pretty time-consuming (in the case that the token does not contain any duplicates).