flex -l longest pattern match strategy - not here?

I have two lex rules and was wondering why I never matched the second rule. Instead rule 1 always fired upon the pattern 2005-05-09-11.23.04.790000

<data>[-]?[0-9]*[.][0-9]*  { comma=0;
             printf("DEBUG: data 1 %s\n",yytext);
                strcat(data_line,yytext); }
<data>[0-9]{4}[-][01][0-9][-][0-3][0-9][-][0-9]{2}[.][0-9]{2}[.][0-9]{2}[.][0-9]{6} {
printf("DEBUG: data 2[%s]\n",yytext);
/* 1996-07-15-hh.00.00*/

I thought, flex/lex would follow the longest string match rule?

Interestingly flex (without the -l lex compatibility) behaves "right", at least as I want to have it behave.

Solution

This is one of several "gotchas" related to Posix-/lex- compatibility [Note 1]. For historical reasons, the (Posix-standard) lex regular expression dialect differs from (Posix-standard) EREs ("extended regular expressions"), even though Posix uses the same abbreviation to describe the lex dialect.

The difference is the precedence of the brace-repetition operator. In standard EREs, and pretty well every other regular expression variety I know of, abc{3} would match abccc. And that's how it is interpreted by flex, too, unless you specify the -l or --posix flags. If you request lex-compatibility, the precedence of the brace operator becomes lower than that of concatenation, so abc{3} matches abcabcabc.

If you want to write regexes which will work with either regex variety, you must parenthesize all (or almost all) uses of the repetition operator. So your second pattern would need to be written as:

[0-9]{4}[-][01][0-9][-][0-3][0-9][-]([0-9]{2})[.]([0-9]{2})[.]([0-9]{2})[.]([0-9]{6})

As written, it won't match the specified input, while the first rule will happily match the leading year.

For what it's worth, the other postfix repetition operators -- +, * and ? -- have the normal high precedence in lex mode. (In a way, this inconsistency makes the behaviour of brace-repetition even more confusing.)

Another gotcha with braces in lex-mode is that when they are used as macro replacement, no implicit parentheses are added. So in flex:

foo     [fF][oO][oO]
%%
{foo}+  {
          /* yytext is some number of case-insensitive repetitions of foo */
        }

whereas in lex-compatibility mode

foo     [fF][oO][oO]
%%
{foo}+  {
          /* yytext is an 'f' or 'F' followed by at least two 'o' or 'O's */
        }

Notes:

The -l (and --posix) flags are options I recommend avoiding. Only use them when absolutely necessary to compile legacy code developed to the lex standard.