I am writing a lexer for a C preprocessor among some other C syntax. As part of that I have to identify file names in #include. The problem I am stuck in is in identifying the file name. File name contains two part: basename and extension. Basename can be identified using the "IDENTIFIER" regex of lexer; so can be "." separating them.
There is a separate regex for "IDENTIFIER" and for ".". For file name I am considering writing another regex which would basically be a concatenation of "IDENTIFIER", "." and "h". My question is that if a write a regex for file name as I described; how would it be processed. Considering the fact that there is already a rule for separate tokens; will instead of identifying file name it will identify 3 tokens (IDENTIFIER, DOT and IDENTIFIER) or will it identify file name?
There is, as far as I can see, no good reason for a preprocessor lexer to view a filename in an include directive as anything other than an opaque sequence of characters. The precise name is not relevant to the preprocessor; it may contain no extension or more than one .
(provided the operating system permits that, which most do these days); it might include special characters such as slashes; it might be a number; etc.
Also, the handling of angle brackets and quotes is idiosyncratic within the argument to an include directive. Consequently, the usual way to deal with include directives is to use a context-sensitive pattern, for example using (f)lex start conditions.
Since newlines are also handled specially in all preprocessor directives, you'll normally need a context-sensitive pattern for them as well.
A rough sketch using flex syntax. Lots of details are left out.
%x PP_DIRECT PP_ARG PP_INCLUDE
%%
^[[:blank:]]*"#" { BEGIN(PP_DIRECT); }
<PP_DIRECT>include { BEGIN(PP_INCLUDE); return T_INCLUDE; }
/* You might want to recognize other include directives as
* specific keyword tokens. In particular, the scanner needs
* to be aware of conditionals, since it might have to put itself
* into a mode where it skips to the matching #endif
*/
<PP_DIRECT>[[:alpha:]]+ { BEGIN(PP_ARG); /* ... */ }
/* Normally newlines are not returned to the parser, but here we do. */
<PP_ARG>\n { BEGIN(INITIAL); return '\n'; }
/* This should actually be done in a previous step */
<PP_ARG>\\\n /* IGNORE */
<PP_INCLUDE>["][^"]*["] { yytext[yyleng-1] = 0;
do_include(yytext+1);
/* Really, should check that only whitespace follows */
BEGIN(PP_ARG);
}
<PP_INCLUDE>[<][^>]*[>] { yytext[yyleng-1] = 0;
do_system_include(yytext+1);
BEGIN(PP_ARG);
}