Search code examples
bisonflex-lexer

Flex-lexer: later rule get priority over prior rule


I'm trying to extract information from a c/c++ sourcefile. I'm trying to extract the content of a Macro.

E.g.:

  • From MYMACRO(random content) random content should be extracted.
  • From MYMACRO (random content) random content should be extracted.

Problem: Bison won't recognice MYMACRO as a token.

This code is only the first step and expects only the Macro itself as an input

Lex-File: parser.l

%{
 #include <iostream>
 #include "parser.tab.h"
 using namespace std;
 extern int yylex(); 
%}

%option noyywrap

%%

"MYMACRO" {
  return EXTRACT_CONTENT_START;
}

[(] {
  return BRACE_OPEN;
}

[)] {
  return BRACE_CLOSE; 
}

.* { 
    yylval.sval = strdup(yytext);
    return ANY_TEXT;
}


%%

bison-file: parser.y:

%{

  #include <iostream>
  #include <string.h>
  using namespace std;

  extern int yylex();
  extern int yyparse();
  extern int yy_scan_string(char const *);

  void yyerror(const char *s);

%}

%union {
  int ival;
  char * sval;
  char cval;
}

%error-verbose


%token EXTRACT_CONTENT_START
%token <cval> BRACE_OPEN
%token <cval> BRACE_CLOSE
%token <sval> ANY_TEXT

%%

program:
    EXTRACT_CONTENT_START 
    BRACE_OPEN
    ANY_TEXT
    BRACE_CLOSE 
    ;

%%

int main(int ,char**){
  yy_scan_string("MYMACRO(random content)");
  yyparse();
}

void yyerror(const char *s) {
  cout << endl << s << endl;
  exit(-1);
}
  • expected: random content
  • actual: unexpected ANY_TEXT, expecting EXTRACT_CONTENT_START ( So @Flex: instead of sending the first appearing rule, the last rule is actual being used)

I've also tried using states and change the last rule in the flex-file to

<STATE_CONTENT> .* { 
    yylval.sval = strdup(yytext);
    return ANY_TEXT;
} 

But this will result in an unrecognized rule error on the line containint %%.


Solution

  • The reason, why the last rule is taken in preference:

    lex uses the longest match. And .* fits more characters, than anything else. Therefore ANY_TEXT is always the taken choice.

    To solve it change it like this:

    parser.l:

    remove .*-rule and add this one:

    . { 
        yylval.cval = *yytext;
        return ANY_CHAR;
    }
    

    This rule's longest match is only one character. It will therefore be on lowest priority comparing to the other rules.

    parser.y:

    Add a new token:

    %token <cval> ANY_CHAR
    

    For acting on the whole string, add:

      anyText:
          anyText ANY_CHAR { cout << $2; }    
          |
      ;
    

    @State problem: Answer from rici:

    You cannot put whitespace before the pattern, whether or not it is preceded by a state. Another way of saying that, which is technically more accurate, is that patterns cannot contain unquoted whitespace, and the prefix is part of the pattern