Search code examples
parsingbisonflex-lexeryacclex

Is my lexer doing too much -- is it doing the work of the parser?


My input consists of a series of names, each on a new line. Each name consists of a firstname, optional middle initial, and lastname. The name fields are separated by tabs. Here is a sample input:

Sally   M.    Smith
Tom     V.    Jones
John          Doe

Below are the rules for my Flex lexer. It works fine but I am concerned that my lexer is doing too much: it is determining that a token is a firstname or a middle initial or a lastname. Should that determination be done in the parser, not the lexer? Am I abusing the Flex state capability? What I am seeking is a critique of my lexer. I am just a beginner, how would a parsing expert create lexer rules for this input?

<INITIAL>{
         [a-zA-Z]+          { yylval.strval = strdup(yytext); return(FIRSTNAME); }
         \t                 { BEGIN MI_STATE; }
         .                  { BEGIN JUNK_STATE; }
}
<MI_STATE>{
        [A-Z]\.             { yylval.strval = strdup(yytext); return(MI); }
        \t                  { BEGIN LASTNAME_STATE; }
         .                  { BEGIN JUNK_STATE; }
}
<LASTNAME_STATE>{
         [a-zA-Z]+          { yylval.strval = strdup(yytext); return(LASTNAME); }
         \n                 { BEGIN INITIAL; return EOL; }
         .                  { BEGIN JUNK_STATE; }
}
<JUNK_STATE>.               { printf("JUNK: %s\n", yytext);  }

Solution

  • You can use lexer states as you do in this question. But it's better to use them as a means to conditionally activate rules. For examples, think of handling multi-line comments or here documents or (for us silverbacks) embedded SQL.

    In your question, there's no lexical difference between a given name and a family name -- they both are matched by [a-zA-Z]+, as would be middle names, if you were to extend your lexer.

    Short answer: yes, lex NAME tokens and let the parser determine whether you have three NAME tokens on a line.