Search code examples
c++regextokenflex-lexer

Ambiguity with regular expressions to recognize tokens


I'm programming in C++, using Flex lexer and I need to identify two different tokens, but both share some symbols. I need to recognize expressions of the type (a<b+c*4), and secondly I need to recognize the logical operated <->. If I put a <-> b (a iff b), the lexer takes it as a<, -> and b, but I want to get a, <-> and b.

//REGEX FOR MATH EXPRESSIONS.
[a-zA-Z0-9<>=]+
//REGEX FOR THE <-> LOGIC OPERATOR
"<->"

This is my flex code:

%option noyywrap
%{
    #include <iostream>
    #include "parser.tab.c"
    using namespace std;
%}

%%

[a-zA-Z0-9<>=]+  {
    yylval = strdup(yytext);
    return SYMBOL;
}

"&&" {
    return AND;
}

"||" {
    return OR;
}

"!" {
    return NOT;
}

"!(" {
    return DIST;
}

[ \0\0] {
    return END;
}

"("     {
    return LEFT_PAR;
}

")"     {
    return RIGHT_PAR;
}

"->"    {
    return THEN;
}

"<->"   {
    return IFF;
}

%%

How I can solve this problem?

Greetings.


Solution

  • It seems a bit odd to me to want to parse only part of an expression grammar. It's really not appropriate to try to recognize expressions in a scanner, although it certainly can be done. Personally, I'd just leave it to the parser to handle the arithmetic part of the expressions (perhaps by recombining the tokens into a string), particularly since it will need to intervene anyway to handle parentheses.

    Nonetheless, it is possible to coerce flex into doing this work, either by using yymore() to accumulate tokens, or by accumulating the tokens yourself in a string accumulator. Either way, you'll end up having to "send" the accumulated expression when you find some other token (one of your logical operators, for example); that is a lot easier if you use a push parser (definitely my preference), but it can be done using start conditions.

    In the interests of not needing a string accumulator object nor a particular version of bison (and not knowing how you are handling the tokens), here's a start-condition based solution which uses flex to accumulate the tokens:

    %x IN_ARITH
    arith_piece     [[:alnum:]]+|[-+<>=]
    %%
                                int symbol_leng = 0;
    
    "&&"                      { return AND;                    }
    "||"                      { return OR;                     }
    "->"                      { return THEN;                   }
    "<->"                     { return IFF;                    }
    "!("                      { return DIST;      /* Note 1 */ }
    " "                       { return END;       /* Note 2 */ }
    .|\n                      { return yytext[0]; /* Note 3 */ }
    
    <*>{arith_piece}          { BEGIN(INITIAL);   /* Note 5 */
                                yylval = strdup(yytext);
                                return SYMBOL;                 }
    
    <*>{arith_piece}/(.|\n)   { BEGIN(IN_ARITH);  /* Note 4 */
                                symbol_leng = yyleng; 
                                yymore();                      }
    
    <IN_ARITH>"->"|"<->"|.|\n { BEGIN(INITIAL);   /* Note 6 */
                                yyless(symbol_leng);
                                yylval = strdup(yytext);
                                yylval[symbol_leng] = 0;
                                return SYMBOL;                 }
    

    Notes

    1. I can't help thinking that returning DIST instead of NOT and then LPAREN makes the parse more complicated rather than simpler.

    2. No pattern matches a newline character (at least, in the original scanner). In the OP, this was [ \0\0], which seems very odd to me; there is no reason to repeat the \0, and in any event \0 hardly ever appears in the input, while newline characters are quite common. I suppose that you want the scan to terminate on whitespace; I would have preferred [[:space:]] { return END; } which would do just that. In the unlikely event that you really have a NUL character in the input stream, the default rule I added will handle it "correctly".

    3. This default rule returns the character code as the token number for any otherwise unmatched character (including newline, but see above.) That will work perfectly if LEFT_PAR, RIGHT_PAR and NOT have the values '(', ')', and '!', respectively. If you're using bison to parse, you can achieve this result by not using named tokens at all for single-character tokens; you can just put '(' in the production, without even declaring that it is a token.

      If all that doesn't fit with your parsing model, just use the rules you had in the original.

    4. The token matched by {arith_piece} (either a sequence of letters and numbers, or a single arithmetic operator) is just added to the accumulating token using yymore(). (See note 5 for an explanation of the trailing context.) We unconditionally switch to the <IN_ARITH> start condition, which does nothing if we are already in that start condition, so that we can output the accumulated token before outputting a logical operator. (See Note 6).

      flex flags this as "dangerous trailing context", but it will work fine; in this case, you can ignore the warning message.

    5. This rule can only match if the following one doesn't, and since the trailing context on the previous rule will match any following character, this rule can only match if the match is right at the end of the input. So in the action for this rule, we know that we are at EOF. This complicated game is necessary because yytext is not valid in an <<EOF>> rule, even if the previous token was retained with yymore().

    6. Anything not matched by {arith_piece} -- that is, anything which could be matched by a pattern in the <INITIAL> start condition -- needs to "send" the accumulated token and then be handled as it would be in that start condition. Without a push parser, though, it is impossible to send two tokens from a single scan action, so here we just send the accumulated arithmetic string, and switch to the <INITIAL> start condition. We use yyless to adjust the length of the accumulated string, effectively removing the token we just scanned; that will force it to be rescanned in the <INITIAL> start condition, which will send the corresponding token.