Lex program won't recognize word when using OR statement

I am running the following lex program which works fine to recognize the sentence about the cat:

%{
        #include <iostream>
        #include <cstdio>
        #include <cstdlib>
        using namespace std;
        extern "C" int yylex();
%}

SP      [ ]+
ARTICLE "le" /* Line I am trying to change */
COMMUN "chat"
VERBE "est"
NOIR "noir"

PHRASE {ARTICLE}{SP}{COMMUN}{SP}{VERBE}{SP}{NOIR}


%%

^{PHRASE}\n     { cout << "Une phrase : " << yytext << '\n'; }

\n              { cout << '\n'; }

^.*\n           { cout << "Ligne inconnue : " << yytext << '\n'; }

%%

int main(int argc, char *argv[])
{
        ++argv, --argc;  
        if(argc > 0)
                yyin = fopen(argv[0], "r");
        else
        yyin = stdin;

        yylex();
} /* main() */

I get the following output:

Ligne inconnue : le professeur est Jean

Ligne inconnue : le professeur a un ordinateur

Ligne inconnue : Jean aime Linux

**Une phrase : le chat est noir**

Ligne inconnue : les etudiants ont des ordinateurs

But, when I try to add an OR statement to the program (for the ARTICLE), the cat sentence is no longer recognized:

%{
        #include <iostream>
        #include <cstdio>
        #include <cstdlib>
        using namespace std;
        extern "C" int yylex();
%}

SP      [ ]+
ARTICLE "le"|"la" /* Line I am trying to change */
COMMUN "chat" 
VERBE "est"
NOIR "noir"

PHRASE {ARTICLE}{SP}{COMMUN}{SP}{VERBE}{SP}{NOIR}


%%

^{PHRASE}\n     { cout << "Une phrase : " << yytext << '\n'; }

\n              { cout << '\n'; }

^.*\n           { cout << "Ligne inconnue : " << yytext << '\n'; }

%%

int main(int argc, char *argv[])
{
        ++argv, --argc;  
        if(argc > 0)
                yyin = fopen(argv[0], "r");
        else
        yyin = stdin;

        yylex();
}

This will give me the following output :

Ligne inconnue : le professeur est Jean

Ligne inconnue : le professeur a un ordinateur

Ligne inconnue : Jean aime Linux

**Ligne inconnue : le chat est noir**

Ligne inconnue : les etudiants ont des ordinateurs

The input file is just a text file with following lines :

le professeur est Jean

le professeur a un ordinateur

Jean aime Linux

le chat est noir

les etudiants ont des ordinateurs

Can anyone tell me why this won't work? I have tried every variation of the OR statement I can find online and still nothing works.

Thanks!

Solution

The flex -l flag is implemented so that it is possible to comtinue to process really old lex specifications which wouldn't otherwise work. For any newly-written scanner, you really don't want that flag. This particular issue is a common reason.

The problem comes from the handling of macro expansion: flex does the common-sense thing, which avoids many common errors; lex (and flex -l), however, make it much easier for you to shoot your foot with a macro definition.

Just in case it's not obvious, what lex calls a "definition" is, in reality, a macro. And just like C preprocessor macros, lex macros introduce a number of potential misunderstandings.

I suppose that almost every C programmer who has ever used the preprocessor has stumbled upon this gotcha:

#define SUM(a,b) a+b    // DON'T DO THIS, EVER

Although you might use this macro successfully in certain contexts, you will eventually discover that

int c = SUM(a,b) * 2;

computes a+b*2 rather than the expected (a+b)*2. That's because macro substitution is just symbol substitution; if there were no parentheses in the macro, none are generated.

That's the way lex worked, too, and it's the way the Posix standard says it's supposed to work. But many years ago, the authors of flex realised that pretty well no-one expects definitions like the following to work the way that they do:

ARTICLE "le"|"la"
%%
{ARTICLE}" chat"  { /* Matches either "le" or "la chat" */ }

So flex (usually) automatically inserts the needed parentheses, as though you had correctly defined ARTICLE as:

ARTICLE ("le"|"la")

However, that's incompatible with the original lex, and it might break old lex programs which depended on the original, annoying-literal semantics.

So flex provides the -l ("Lex compatibility") option, which can be used to process these old lex programs. But, as I said, it should not be used for any new lex program.

And just in case the above is not sufficiently convincing, that's not the only bad choice made by the original lex which is preserved by the -l flag. Another one is the bizarre operator precedence of the counted repetition operator {m,n}. In flex,

ab*   ab+   ab?   ab{0,3}

mean, respectively:

"an a followed by zero or more bs"
"an a followed by one or more bs"
"an a followed by an optional b"
"from zero to three repetitions of ab"

Flex fixes this inconsistency by making the operator precedence of bracket repetition the same as the operator precedence of any other repetition operator, which is undoubtedly what everyone expects. Again, the -l flag reverts to the original lex specification.

Finally, the -l option makes the default yytext declaration be an array rather than a pointer. While this can make a few things easier, on the whole it brings a some important disadvantages, including:

it's a lot slower.
it prevents the scanner from resizing its buffer to cope with long tokens

The bottom line: Don't use the flex -l option (nor, while we're on this subject, the bison -y option) unless you need it to compile legacy code.