Search code examples
c++cregexflex-lexerlexical-analysis

Flex/Lex: Input string that does not contain duplicate letters [A-Z]


Having read similar questions around the topic such as Flex/Lex: Regular Expression matches double characters and Is there a difference between [ABCDEFGH] and [A-H] on flex? I can't figure out a way to match strings that contain duplicate letters [A-Z] in a {2,8}-length string.

My code will freeze compiling as I run lex my_pgrogram.l in Windows 11 (WSL 2.0) and in Mac OS it will return unrecognized rule in lines 19, 21, 23, 24, and 26:

    /* <IMPORTS> */
    %{
    %}
    /* </IMPORTS> */
    
    /* <RULES AND TOKENS> */
    point (?i:point)
    line (?i:line)
    triangle (?i:triangle)
    square (?i:square)
    pentagon (?i:pentagon)
    hexagon (?i:pehagon)
    heptagon (?i:heptagon)
    octagon (?i:octagon)
    /* </RULES AND TOKENS> */
    
    /* <RULES> */
    %%
    ^[ \t]*({point}|{line}|{triangle}|{square}|{pentagon}|{hexagon}|{heptagon}|{octagon})[ \t]*$ printf("You forgot to enter the geometric entity's name e.g. triangle ABC. Please try again.\n\n");
    ^[ \t]*([A-Z]{1,8})[ \t]*$ printf("You forgot to enter a geomteric entity e.g. triangle. Please try again.\n\n");
    {point}|{line}|{triangle}|{square}|{pentagon}|{hexagon}|{heptagon}|{octagon} printf("--> %s: is a geometric entity.\n", yytext);  /* geometric entity is recognised! */
    
    ^[ \t]*({point}|{line}|{triangle}|{square}|{pentagon}|{hexagon}|{heptagon}|{octagon})[ \t]*[A-Z]*([a-z])+[A-Z]*([a-z])*[ \t]*$ printf("Only Capital letters allowed for geometric entity's name. Please try again with characters from A to Z.\n\n");
    {point}[ \t]*[A-Z]{2,}|{line}[ \t]*[A-Z]{1}|{line}[ \t]*[A-Z]{3,}|{triangle}[ \t]*[A-Z]{1,2}|{triangle}[ \t]*[A-Z]{4,}|{square}[ \t]*[A-Z]{1,3}|{square}[ \t]*[A-Z]{5,}|{pentagon}[ \t]*[A-Z]{1,4}\n|{pentagon}[ \t]*[A-Z]{6,}|{hexagon}[ \t]*[A-Z]{1,5}|{hexagon}[ \t]*[A-Z]{7,}|{heptagon}[ \t]*[A-Z]{1,6}|{heptagon}[ \t]*[A-Z]{8,}|{octagon}[ \t]*[A-Z]{1,7}|{octagon}[ \t]*[A-Z]{9,} printf("Invalid geometric entity's name. Please try again.\n\n");
    
    [A-Z]*A[A-Z]*A[A-Z]*|[A-Z]*B[A-Z]*B[A-Z]*|[A-Z]*C[A-Z]*C[A-Z]*|[A-Z]*D[A-Z]*D[A-Z]*|[A-Z]*E[A-Z]*E[A-Z]*|[A-Z]*F[A-Z]*F[A-Z]* printf("Oops! You entered some letter twice. Please try again: %s\n", yytext);
    
    [A-Z]{1,8} printf("--> %s: is a geometric entity's name.\n\n", yytext);   /* geometric entity's name is recognised! */
    
    [ \t\n]+
    .+
    %%
    /* </RULES> */
    
    int main(){
        yylex();
    }

Solution

  • Regular expressions -- at least, the regular expressions implemented by (f)lex, which don't implement captures -- are really not the tool you want to use to answer the question about whether a token contains duplicated characters.

    Up to a certain point you can do it, because it's easy to write a regular expression which checks that a particular character hasn't been duplicated, and you can just combine checks for each character in your alphabet. If you don't have too many different characters, that will work OK. But it doesn't scale, because the size of the resulting state machine is exponential in the number of different letters. (And also because flex, for whatever reasons, doesn't use the standard exponential reallocation algorithm when expanding its internal tables, so once the tables get large, all flex is doing is copying the them over and over again in order to slightly increase their length.)

    Doing this test for 26 letters is wholly unrealistic; I estimate that it would require something like 16 thousand million states, but it would take basically forever to find out. My patience ran out at A-O, which took 20 minutes; I guess that A-P would take two hours, and so on. Anyway, I probably don't have enough RAM to do the full alphabet even if I were willing to wait for completion (or modify the code to implement a more sensible reallocation strategy.)

    I note in passing that the final state machine could be easily minimised to fit in RAM. But (f)lex doesn't do state minimisation, and anyway waiting for the subset construction to complete defeats the purpose; you'd need a very different algorithm to do the NFA->DFA conversion.

    So you need to find a more feasible strategy; your best bet is to just scan the token once flex has identified it, and see if it includes duplicates. (You could also make sure that it's the right length.)

    If you were fixated on efficiency, you could do better than the following, but the minor inefficiency of initializing 256 booleans instead of 26 isn't really going to be noticeable in a student exercise, so I'd suggest something like the following simple code:

    /* Returns true if the token is OK (all unique chars)
     * Otherwise false.
     */
    bool all_unique(const char* token) {
      char seen[256] = {0};
      for (const char* p = token; *p; ++p) {
        /* Even if you don't expect eight-bit characters, you
         * don't want to blow up if you get one.
         */
        unsigned index = (unsigned char)*p;
        if (seen[index]) return false; /* Found a dupe */
        seen[index] = true;
      }
      return true; /* No duplicates. Could be other problems. */
    }