Search code examples
lex

Concatenated tokens in Lex


I am writing a lexer for a C preprocessor among some other C syntax. As part of that I have to identify file names in #include. The problem I am stuck in is in identifying the file name. File name contains two part: basename and extension. Basename can be identified using the "IDENTIFIER" regex of lexer; so can be "." separating them.

There is a separate regex for "IDENTIFIER" and for ".". For file name I am considering writing another regex which would basically be a concatenation of "IDENTIFIER", "." and "h". My question is that if a write a regex for file name as I described; how would it be processed. Considering the fact that there is already a rule for separate tokens; will instead of identifying file name it will identify 3 tokens (IDENTIFIER, DOT and IDENTIFIER) or will it identify file name?


Solution

  • There is, as far as I can see, no good reason for a preprocessor lexer to view a filename in an include directive as anything other than an opaque sequence of characters. The precise name is not relevant to the preprocessor; it may contain no extension or more than one . (provided the operating system permits that, which most do these days); it might include special characters such as slashes; it might be a number; etc.

    Also, the handling of angle brackets and quotes is idiosyncratic within the argument to an include directive. Consequently, the usual way to deal with include directives is to use a context-sensitive pattern, for example using (f)lex start conditions.

    Since newlines are also handled specially in all preprocessor directives, you'll normally need a context-sensitive pattern for them as well.

    A rough sketch using flex syntax. Lots of details are left out.

    %x PP_DIRECT PP_ARG PP_INCLUDE
    %%
    
    ^[[:blank:]]*"#"   { BEGIN(PP_DIRECT); }
    <PP_DIRECT>include { BEGIN(PP_INCLUDE); return T_INCLUDE; }
      /* You might want to recognize other include directives as
       * specific keyword tokens. In particular, the scanner needs
       * to be aware of conditionals, since it might have to put itself
       * into a mode where it skips to the matching #endif
       */
    <PP_DIRECT>[[:alpha:]]+ { BEGIN(PP_ARG);  /* ... */ }
      /* Normally newlines are not returned to the parser, but here we do. */
    <PP_ARG>\n         { BEGIN(INITIAL); return '\n'; }
      /* This should actually be done in a previous step */ 
    <PP_ARG>\\\n       /* IGNORE */
    <PP_INCLUDE>["][^"]*["]  { yytext[yyleng-1] = 0;
                               do_include(yytext+1);
                               /* Really, should check that only whitespace follows */
                               BEGIN(PP_ARG);
                             }  
    <PP_INCLUDE>[<][^>]*[>]  { yytext[yyleng-1] = 0;
                               do_system_include(yytext+1);
                               BEGIN(PP_ARG);
                             }