Search code examples
regexlexcc

Unexpected behaviour in Lex


I have written this code in my bas.l file

digit [0-9]

%% 
{digit}{1,5} {printf("Small Number");}
^{digit}+$ {printf("Big Number");}
%%


int main(void){
    yylex();
    return 0;
}

Now I used lex and cc commands to build my lexical analyser and output an executable file.

Exact commands:

lex bas.l
cc lex.yy.c -ll

When I enter any 1-5 sized number, the lex analyser doesn't match with the first reg expression. It is always matching with the second one and prints Big Number.

Input:43
Output : Big Number

Input:4341
Output : Big Number

Input:434111
Output : Big Number

I expect my analyser to print out Big Number when I input a group of digits larger than 5 and Small Number when less than 5.

I tried removing the ^ and $ in the second reg expression and it works. The analyser matches numbers greater than 5 digits with the second reg expression and prints Big Number. It matches numbers with less digits than 5 with 1st reg expression and prints Small Number.

I want to know why it didn't work in the first case. How those ^, $ in second expression had to do anything with matching the first expression when the input is simply some three digit number like 324. Thanks in advance.

P.s: I do know that ^ matches occurrences at start of the line and $ at the end. Yet I don't see why 1st expression is not being matched.


Solution

  • I think the reason the anchored pattern (^...$) always matches is that it also includes the newline, so it is always the "longer" match, as per the Flex manual which says:

    When the generated scanner is run, it analyzes its input looking for strings which match any of its patterns. If it finds more than one match, it takes the one matching the most text (for trailing context rules, this includes the length of the trailing part, even though it will then be returned to the input).

    In (f)lex, the $ is a "trailing context operator" which matches zero characters followed by a newline character, so the newline is included in the length of the match. The other trailing context operator is \, so foo$ is the same as foo/\n.