Search code examples
regexbisonflex-lexeryacclex

Flex/Bison - My regular expression is not matching occurrences of two or more X's, example XXY-1 or XXY-1


I'm creating a parser for a fictional programming language using flex and bison. There is going to be valid and invalid variable names.

XXXX XY-1 // valid
XXXXX Z // valid
XXX Y // valid
XXX 5Aet // invalid
XXXX XXAB-Y // invalid

The x's at the start are just specifying the size of the variable. The variable 5Aet is invalid because it starts with a number. I have managed to match a regular expression for this

[\_\-0-9][a-zA-Z][a-zA-Z0-9\-\_]* yylval.string = strdup(yytext);return TERM_INVALID_VARIABLE_NAME;

The variable XXAB-Y is invalid because a variable name cannot start with two or more x characters.

I have tried to match a regular expression for this but I have been unsuccessful. I have tried various combinations of the expression below but none of worked. The variable keep getting matched as valid.

[X]{2,}[A-Z0-9\-]* yylval.string = strdup(yytext);return TERM_INVALID_VARIABLE_NAME;

[X]{2,0}[\_\-0-9][a-zA-Z][a-zA-Z0-9\-\_]* yylval.string = strdup(yytext);return TERM_INVALID_VARIABLE_NAME;

lexer.l snippet

[\t ]+ // ignore whitespaces

\n // Ignore new line

[\"][^"]*[\"] yylval.string = strdup(yytext); return TERM_STR;

";" return TERM_SEPARATOR;

"." return TERM_FULLSTOP;

[0-9]+ yylval.integer = atoi(yytext); return TERM_INT;

XX[A-Z0-9-]* yylval.string = strdup(yytext);return TERM_INVALID_VARIABLE_NAME;

[\_\-0-9]+[a-zA-Z][a-zA-Z0-9\-\_]* yylval.string = strdup(yytext);return TERM_INVALID_VARIABLE_NAME;

[A-Z][A-Z0-9\-]* yylval.string = strdup(yytext); return TERM_VARIABLE_NAME;

[X]+ yylval.integer = yyleng; return TERM_SIZE;

. return TERM_INVALID_TOKEN;

parser.y snippet

program:
    /* empty */ | 
    begin middle_declarations body grammar_s end {
        printf("\nParsing complete\n");
        exit(0);
    };

begin:
    TERM_BEGINING TERM_FULLSTOP;

body:
    TERM_BODY TERM_FULLSTOP;

end:
    TERM_END TERM_FULLSTOP;

middle_declarations:
    /* empty */ |
    //Left recursive to allow for many declearations
    middle_declarations declaration TERM_FULLSTOP;

declaration:
    TERM_SIZE TERM_VARIABLE_NAME {
        createVar($1, $2);
    }
    |
    TERM_SIZE TERM_INVALID_VARIABLE_NAME {
        printInvalidVarName($2);
    };

grammar_s:
    /* empty */ |
    grammar_s grammar TERM_FULLSTOP;

grammar:
    add | move | print | input;

add:
    TERM_ADD TERM_INT TERM_TO TERM_VARIABLE_NAME {
        addIntToVar($2, $4);
    }
    |
    TERM_ADD TERM_VARIABLE_NAME TERM_TO TERM_VARIABLE_NAME {
        addVarToVar($2, $4);
    }

    ;

move:
    TERM_MOVE TERM_VARIABLE_NAME TERM_TO TERM_VARIABLE_NAME {
        moveVarToVar($2, $4);
    }
    |
    TERM_MOVE TERM_INT TERM_TO TERM_VARIABLE_NAME {
        moveIntToVar($2, $4);
    }

    ;

print:
    /* empty */ |
    TERM_PRINT rest_of_print {
        printf("\n");
    };

rest_of_print:
    /* empty */ |
    rest_of_print other_print;

other_print:

    TERM_VARIABLE_NAME {
        printVarValue($1);
    }
    |
    TERM_SEPARATOR {
        // do nothing
    }
    |
    TERM_STR {
        printf("%s", $1);
    }

    ;

input:
    // Fullstop declares grammar
    TERM_INPUT other_input;

other_input:

    /* empty */ |
    // Input var1
    TERM_VARIABLE_NAME {
        inputValues($1);
    }
    |
    // Can be input var1; var2;...varN
    other_input TERM_SEPARATOR TERM_VARIABLE_NAME {
        inputValues($2);
    }
    ;

Debug output:

Starting parse
Entering state 0
Reading a token: Next token is token TERM_BEGINING (1.1: )
Shifting token TERM_BEGINING (1.1: )
Entering state 1
Reading a token: Next token is token TERM_FULLSTOP (1.1: )
Shifting token TERM_FULLSTOP (1.1: )
Entering state 4
Reducing stack by rule 3 (line 123):
   $1 = token TERM_BEGINING (1.1: )
   $2 = token TERM_FULLSTOP (1.1: )
-> $$ = nterm begin (1.1: )
Stack now 0
Entering state 3
Reducing stack by rule 6 (line 131):
-> $$ = nterm middle_declarations (1.1: )
Stack now 0 3
Entering state 6
Reading a token: Next token is token TERM_SIZE (1.1: )
Shifting token TERM_SIZE (1.1: )
Entering state 8
Reading a token: Next token is token TERM_VARIABLE_NAME (1.1: )
Shifting token TERM_VARIABLE_NAME (1.1: )
Entering state 13
Reducing stack by rule 8 (line 137):
   $1 = token TERM_SIZE (1.1: )
   $2 = token TERM_VARIABLE_NAME (1.1: )
-> $$ = nterm declaration (1.1: )
Stack now 0 3 6
Entering state 10
Reading a token: Next token is token TERM_FULLSTOP (1.1: )
Shifting token TERM_FULLSTOP (1.1: )
Entering state 15
Reducing stack by rule 7 (line 134):
   $1 = nterm middle_declarations (1.1: )
   $2 = nterm declaration (1.1: )
   $3 = token TERM_FULLSTOP (1.1: )
-> $$ = nterm middle_declarations (1.1: )
Stack now 0 3
Entering state 6
Reading a token: Next token is token TERM_SIZE (1.1: )
Shifting token TERM_SIZE (1.1: )
Entering state 8
Reading a token: Next token is token TERM_VARIABLE_NAME (1.1: )
Shifting token TERM_VARIABLE_NAME (1.1: )
Entering state 13
Reducing stack by rule 8 (line 137):
   $1 = token TERM_SIZE (1.1: )
   $2 = token TERM_VARIABLE_NAME (1.1: )
-> $$ = nterm declaration (1.1: )
Stack now 0 3 6
Entering state 10
Reading a token: Next token is token TERM_FULLSTOP (1.1: )
Shifting token TERM_FULLSTOP (1.1: )
Entering state 15
Reducing stack by rule 7 (line 134):
   $1 = nterm middle_declarations (1.1: )
   $2 = nterm declaration (1.1: )
   $3 = token TERM_FULLSTOP (1.1: )
-> $$ = nterm middle_declarations (1.1: )
Stack now 0 3
Entering state 6
Reading a token: Next token is token TERM_SIZE (1.1: )
Shifting token TERM_SIZE (1.1: )
Entering state 8
Reading a token: Next token is token TERM_VARIABLE_NAME (1.1: )
Shifting token TERM_VARIABLE_NAME (1.1: )
Entering state 13
Reducing stack by rule 8 (line 137):
   $1 = token TERM_SIZE (1.1: )
   $2 = token TERM_VARIABLE_NAME (1.1: )
-> $$ = nterm declaration (1.1: )
Stack now 0 3 6
Entering state 10
Reading a token: Next token is token TERM_FULLSTOP (1.1: )
Shifting token TERM_FULLSTOP (1.1: )
Entering state 15
Reducing stack by rule 7 (line 134):
   $1 = nterm middle_declarations (1.1: )
   $2 = nterm declaration (1.1: )
   $3 = token TERM_FULLSTOP (1.1: )
-> $$ = nterm middle_declarations (1.1: )
Stack now 0 3
Entering state 6
Reading a token: Next token is token TERM_BODY (1.1: )
Shifting token TERM_BODY (1.1: )
Entering state 7
Reading a token: Next token is token TERM_FULLSTOP (1.1: )
Shifting token TERM_FULLSTOP (1.1: )
Entering state 11
Reducing stack by rule 4 (line 126):
   $1 = token TERM_BODY (1.1: )
   $2 = token TERM_FULLSTOP (1.1: )
-> $$ = nterm body (1.1: )
Stack now 0 3 6
Entering state 9
Reducing stack by rule 10 (line 145):
-> $$ = nterm grammar_s (1.1: )
Stack now 0 3 6 9
Entering state 14
Reading a token: Next token is token TERM_PRINT (1.1: )
Shifting token TERM_PRINT (1.1: )
Entering state 20
Reducing stack by rule 22 (line 180):
-> $$ = nterm rest_of_print (1.1: )
Stack now 0 3 6 9 14 20
Entering state 34
Reading a token: Next token is token TERM_STR (1.1: )
Shifting token TERM_STR (1.1: )
Entering state 41
Reducing stack by rule 26 (line 194):
   $1 = token TERM_STR (1.1: )
-> $$ = nterm other_print (1.1: )
Stack now 0 3 6 9 14 20 34
Entering state 44
Reducing stack by rule 23 (line 182):
   $1 = nterm rest_of_print (1.1: )
   $2 = nterm other_print (1.1: )
-> $$ = nterm rest_of_print (1.1: )
Stack now 0 3 6 9 14 20
Entering state 34
Reading a token: Next token is token TERM_FULLSTOP (1.1: )
Reducing stack by rule 21 (line 176):
   $1 = token TERM_PRINT (1.1: )
   $2 = nterm rest_of_print (1.1: )
"hEllo"
-> $$ = nterm print (1.1: )
Stack now 0 3 6 9 14
Entering state 25
Reducing stack by rule 14 (line 150):
   $1 = nterm print (1.1: )
-> $$ = nterm grammar (1.1: )
Stack now 0 3 6 9 14
Entering state 22
Next token is token TERM_FULLSTOP (1.1: )
Shifting token TERM_FULLSTOP (1.1: )
Entering state 35
Reducing stack by rule 11 (line 147):
   $1 = nterm grammar_s (1.1: )
   $2 = nterm grammar (1.1: )
   $3 = token TERM_FULLSTOP (1.1: )
-> $$ = nterm grammar_s (1.1: )
Stack now 0 3 6 9
Entering state 14
Reading a token: Next token is token TERM_END (1.1: )
Shifting token TERM_END (1.1: )
Entering state 16
Reading a token: Next token is token TERM_FULLSTOP (1.1: )
Shifting token TERM_FULLSTOP (1.1: )
Entering state 27
Reducing stack by rule 5 (line 129):
   $1 = token TERM_END (1.1: )
   $2 = token TERM_FULLSTOP (1.1: )
-> $$ = nterm end (1.1: )
Stack now 0 3 6 9 14
Entering state 21
Reducing stack by rule 2 (line 113):
   $1 = nterm begin (1.1: )
   $2 = nterm middle_declarations (1.1: )
   $3 = nterm body (1.1: )
   $4 = nterm grammar_s (1.1: )
   $5 = nterm end (1.1: )

Sample Input:

BeGiNInG.

X XXAB-.
XX XXX7.
XX XXXY.

BoDY.

print "hEllo".

EnD.

Solution

  • [X]{2,}[A-Z0-9\-]* yylval.string = strdup(yytext);return TERM_INVALID_VARIABLE_NAME;
    

    should work just fine, and it does work fine for me. However, it can be simplified to:

    XX[A-Z0-9-]* yylval.string = strdup(yytext);return TERM_INVALID_VARIABLE_NAME;
    

    since any further X characters will match the [A-Z0-9-] character class. (Note that it is not necessary to write \- inside a character class; - will do, provided that it is either the first or last thing in the character class.)

    That pattern (like yours) also matches just XX, but the [X]+ pattern will win since it comes earlier in the flex input file.

    {2,0} is not a valid interval expression, because 0 is less than 2. To specify "2 or more X", write X{2,} (or [X]{2,}, if you prefer. "X"{2,} also works.) That should produce an error message from flex, with the result that no lexical scanner is produced. (However, an old one might still be lying around, which might create confusion.)