Search code examples
flex-lexerlexpreprocessor-directive

Flex. Detect characters after preprocessor directives


I am trying to develop a lexical analyser to detect preprocessor directives and "code to analyze".

I want the analyser to detect processor directives and identifiers, integer constants, etc (but only if these elements are in the same line of the processor directives) and "code to analyze" (lines that are not in the same line of the directives).

For example, for the next code in a txt file,

#define B 0
#ifdef C
#if D > ( 0 + 1 )
main(){
printf(“Hello”);
}

I want to detect the following elements

  1. Directives: #define, #ifdef, #if
  2. Identifiers: B, C, D
  3. Integer constants: 0, 1
  4. Symbol: ( , )
  5. Relation operators: >
  6. Arithmetic operators: +
  7. Code to analyze: main(){, printf(“Hello”); , }

This is my code that implements the analyser:

%{
    /*Libraries Declaration */
    #include <stdio.h>
    #include <stdlib.h>
    #include <string.h>

    /*Functions Headers */

    /*Global variables */

%}

/** Regular Expressions Definitions */

TAB [ \t]+
DIG [0-9]
RESERV_WORD #define|#elif|#else|#endif|#if|#ifdef|#ifndef|#undef

DIR [^#]
OP_RELA {DIR}">"|">="|"<"|"<="|"=="|"!="
OP_ARIT {DIR}"+"|"-"|"*"|"/"|"%"
SYMBOL  {DIR}"("|")"
INT_CTE {DIR}{DIG}+
SYMBOLYC_CTE {DIR}("\"")(.*)("\"")
IDENTIFIER {DIR}[A-Z]{1,8}
CODE_TO_ANALY ^[^#].*
/* Traduction rules*/
%option noyywrap
%%
{TAB}    { }
{CODE_TO_ANALY} {
  printf("[%s] is code to analyze\n",yytext);

}

{OP_RELA}       {           //Detect relational operators
            printf("[%s] is relational operator\n",yytext);
        }

{OP_ARIT}   {
            printf("[%s] is arith operator \n",yytext);
        }

{RESERV_WORD}       {       //Detect reserved words
            printf("[%s] is a reserved word\n",yytext);
        }

{INT_CTE}       {               //Detect integer constants
            printf("[%s] is an integer constant\n",yytext);
        }

{SYMBOL}    { //Detecta special symbols
    printf("[%s] is a special symbol \n",yytext);
}

{SYMBOLYC_CTE}  { //Detecta symbolic constants
            printf("[%s] is a symbolic constant\n",yytext);
        }

{IDENTIFIER}    { //Detect identifiers
            printf("[%s] is an identifier\n",yytext);
}



. {}


%%

int main(int argc, char *argv[])
{
    if(argc>1){
        //User entered a valid file name

        yyin=fopen(argv[1],"r");
        yylex();

        printf("******************************************************************\n");
    }
    else{
        //User didnt enter a valid file name

        printf("\n");
        exit(0);
    }

    return 0;
}

And the analyser works well with a code in a file with spaces between each character.

Input txt file

#define B 0
#ifdef B
#if B > ( 0 + 1 > 5 )
main(){
printf(“Hola programa”)
        }

Output in console

    [#define] is a reserved word
    [ B] is an identifier
    [ 0] is an integer constant
    [#ifdef] is a reserved word
    [ B] is an identifier
    [#if] is a reserved word
    [ B] is an identifier
    [ >] is relational operator
    [ (] is a special symbol 
    [ 0] is an integer constant
    [ +] is arith operator 
    [ 1] is an integer constant
    [ >] is relational operator
    [ 5] is an integer constant
    [)] is a special symbol 
    [main(){] is code to analyze
    [printf(“Hola programa”)] is code to analyze
    [}] is code to analyze

However, a with an input file without spaces between characters does not work correctly.

Input txt file:

#define B 0
#ifdef B
#if B>(0+1)
main(){
printf(“Hola programa”)
}

Output in console:

[#define] is a reserved word
[ B] is an identifier
[ 0] is an integer constant
[#ifdef] is a reserved word
[ B] is an identifier
[#if] is a reserved word
[ B] is an identifier
[>(] is a special symbol 
[0+] is arith operator 
[)] is a special symbol 
[main(){] is code to analyze
[printf(“Hola programa”)] is code to analyze
[}] is code to analyze

Solution

  • Here's an interesting fact. When you're tracing the tokens produced, what you see is (heavily redacted):

    [ (] is a special symbol 
    [)] is a special symbol 
    

    Why does ( show up with a space before it, and not )? And could this somehow be related to the inappropriate token:

    [>(] is a special symbol
    

    With that hint, let's take a look at the definition of SYMBOL. There is a rule:

    {SYMBOL}    { printf("[%s] is a special symbol \n",yytext); }
    

    which depends on the macro definition

    SYMBOL  {DIR}"("|")"
    

    which in turn refers to the macro DIR:

    DIR [^#]
    

    In other words, the result after macro-processing would be, approximately:

    [^#]"("|")" { printf("[%s] is a special symbol \n",yytext); }
    

    That rule will apply to either of two possibilities:

    1. Any character other than a # followed by (

    2. A )

    That pattern is certainly matched by the two characters (, as well as by the single character ). Probably you also have a rule to discard whitespace, but it won't apply in the case of ( because of the longest-match rule. So that does, in fact, explain why the open parenthesis shows up with whitespace before it.

    It also explains what happens with the lexical analysis of #if B>(0+1). First, #if is recognised. Then the rule [^#][A-Z]{1,8} matches, because [^#] matches a space. The next character is >, which does not match [^#]">"|">="|"<"|"<="|"=="|"!=" because > would only match after a character other than #. On the other hand, > is not a #, so that position does match [^#]"("|")". (Compare with what would occur had the input been #if B>=(0+1).)

    So that explains what is going on. But do these rules make any sense?

    I suspect that you were thinking that the {DIR} expansion would cause the rest of the rule to only apply on a line which doesn't start with a #. Nothing in the (f)lex regular expression syntax would suggest that interpretation, and I don't know of any regular expression syntax in which that would work.

    (F)lex does have a mechanism for using different rules in different lexical contexts, which is probably what you want in this case. But that mechanism can only be invoked in rules, not in macro definitions.

    It's worth reading the linked manual section for a complete description; here's a partial solution based on it:

     /* The various contexts for parsing preprocess directives. A full
      * solution would have more of these.
      */
    %x CPP CPP_IF CPP_IFDEF CPP_REST
    %%
      /* Anything which is not a preprocessor command
    [[:blank:]]*[^#\n[:blank:]].*      { printf("%s is code to analyse.\n"); }
      /* cpp directives */
    [[:blank:]]*#[[:blank:]]*          { BEGIN(CPP); }
      /* Anything else is a completely blank line. Ignore it and the trailing newline. */
    .*\n                     { /* Ignore */ }
      /* The first thing in a preprocessor line is normally the command */
       * In a full solution, there would be different contexts for each
       * command type; this is just a partial solution.
       */
    <CPP>{
        (el)?if              { printf("#%s directive\n", yytext); BEGIN(CPP_IF); }
        ifn?def              { printf("#%s directive\n", yytext); BEGIN(CPP_IFDEF); }
        else|endif           { printf("#%s directive\n", yytext); BEGIN(CPP_REST); }
        /* Other directives need to be added. */
        /* Fallbacks */
        [[:alpha:]][[:alnum:]]* { printf("#%s directive\n", yytext); BEGIN(CPP_REST); }
        .                    { puts("Unknown # directive"); BEGIN(CPP_REST); }
        \n                   { BEGIN(INITIAL); }
    }
      /* Context to just skip everything to the end of the pp directive */
    <CPP_REST>(.|\\\n)*      { BEGIN(INITIAL); }
      /* Constant expression context, for #if and #elif */
    <CPP_IF>{
        [[:digit:]]+         { printf("[%s] is an integer constant", yytext); }
        [[:alpha:]_][[:alnum:]_]* { printf("[%s] is an identifier", yytext); }
        [[:blank:]]*         ;
        [-+*/%!~|&]|"||"|"&&" { printf("[%s] is an arithmetic operator", yytext); }
        [=<>!]=?             { printf("[%s] is a relational operator", yytext); }
        [()]                 { printf("[%s] is a parenthesis", yytext); }
        .                    { printf("[%s] is unrecognized", yytext); }
        \n                   { BEGIN(INITIAL); }
    }