Search code examples
grammarbisonflex-lexeryacclex

Simple Regex pattern unmatched with Flex/Bison (Lex/Yacc)


I have built a trivial compiler using Flex and Bison which is supposed to recognize a simple string in a source file and I use the standard error stream to output a message if the string is recognized correctly.

Below is my code and my unexpected result.

This is the source file (testsource.txt) with the string I try to recognize:

\end{document}

This is the Flex file (UnicTextLang.l):

%{
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    #include "y.tab.h"
    void yyerror(char *);
    int yylex(void);
    /* "Connect" with the output file  */
    extern FILE *yyout;
    extern int  yyparse();
%}

%%

^\\end\{document\}$ { yyerror("end matched"); return END; }

    /* skip whitespace */
[ \t] ;

    /* anything else is an error */
. yyerror("invalid character");

%%

int main(int argc, char *argv[]) {
    if ( argc < 3 )
        yyerror("You need 2 args: inputFileName outputFileName");
    else {
        yyin = fopen(argv[1], "r");
        yyout = fopen(argv[2], "w");
        yyparse();
        fclose(yyin);
        fclose(yyout);
    }

    return 0;
}

This is the Bison file (UnicTextLang.y):

%{
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>
    #include "y.tab.h"
    void yyerror(char *);
    int yylex(void);

    /* "Connect" with the output file  */
    extern FILE *yyout;
%}

%token END

%%

document:
        END
        |
        ;

%%

int yywrap(void) {
    return 1;
}

void yyerror(char *s) {
    fprintf(stderr, "%s\n", s); /* Prints to the standard error stream */
}

I run the following commands:

flex UnicTextLang.l
bison -dl -o y.tab.c UnicTextLang.y
gcc lex.yy.c y.tab.c -o UnicTextLang
UnicTextLang.exe testsource.txt output.txt

What I expect to see printed in the console is

end matched

But this is what I get:

invalid character
invalid character
invalid character
invalid character
invalid character
invalid character
invalid character
invalid character
invalid character
invalid character
invalid character
invalid character
invalid character
invalid character
invalid character

What’s wrong?


Solution

  • This issue is caused by the end-of-line code for a Windows machine being two characters (\r\n) when on other systems it is one (\n).

    This is explained in the flex manual:

    ‘r$’
    an ‘r’, but only at the end of a line (i.e., just before a newline). Equivalent to ‘r/\n’.

    Note that flex’s notion of “newline” is exactly whatever the C compiler used to compile flex interprets ‘\n’ as; in particular, on some DOS systems you must either filter out ‘\r’s in the input yourself, or explicitly use ‘r/\r\n’ for ‘r$’.

    The quick solution is to change:

    ^\\end\{document\}$
    

    to

    ^\\end\{document\}\r\n
    

    However, if your expression is at the end-of-file without an end-of-line, which is possible in Windows, then you would have to specifically match that case also. Flex does permit the matching of end-of-file with:

    <<EOF>>
    

    but this will cause all kinds of other side effects and it is often easier not to anchor the pattern to the end (of line or file).