Concatenated tokens in Lex

I am writing a lexer for a C preprocessor among some other C syntax. As part of that I have to identify file names in #include. The problem I am stuck in is in identifying the file name. File name contains two part: basename and extension. Basename can be identified using the "IDENTIFIER" regex of lexer; so can be "." separating them.

There is a separate regex for "IDENTIFIER" and for ".". For file name I am considering writing another regex which would basically be a concatenation of "IDENTIFIER", "." and "h". My question is that if a write a regex for file name as I described; how would it be processed. Considering the fact that there is already a rule for separate tokens; will instead of identifying file name it will identify 3 tokens (IDENTIFIER, DOT and IDENTIFIER) or will it identify file name?

Solution

There is, as far as I can see, no good reason for a preprocessor lexer to view a filename in an include directive as anything other than an opaque sequence of characters. The precise name is not relevant to the preprocessor; it may contain no extension or more than one . (provided the operating system permits that, which most do these days); it might include special characters such as slashes; it might be a number; etc.

Also, the handling of angle brackets and quotes is idiosyncratic within the argument to an include directive. Consequently, the usual way to deal with include directives is to use a context-sensitive pattern, for example using (f)lex start conditions.

Since newlines are also handled specially in all preprocessor directives, you'll normally need a context-sensitive pattern for them as well.

A rough sketch using flex syntax. Lots of details are left out.

%x PP_DIRECT PP_ARG PP_INCLUDE
%%

^[[:blank:]]*"#"   { BEGIN(PP_DIRECT); }
<PP_DIRECT>include { BEGIN(PP_INCLUDE); return T_INCLUDE; }
  /* You might want to recognize other include directives as
   * specific keyword tokens. In particular, the scanner needs
   * to be aware of conditionals, since it might have to put itself
   * into a mode where it skips to the matching #endif
   */
<PP_DIRECT>[[:alpha:]]+ { BEGIN(PP_ARG);  /* ... */ }
  /* Normally newlines are not returned to the parser, but here we do. */
<PP_ARG>\n         { BEGIN(INITIAL); return '\n'; }
  /* This should actually be done in a previous step */ 
<PP_ARG>\\\n       /* IGNORE */
<PP_INCLUDE>["][^"]*["]  { yytext[yyleng-1] = 0;
                           do_include(yytext+1);
                           /* Really, should check that only whitespace follows */
                           BEGIN(PP_ARG);
                         }  
<PP_INCLUDE>[<][^>]*[>]  { yytext[yyleng-1] = 0;
                           do_system_include(yytext+1);
                           BEGIN(PP_ARG);
                         }