Search code examples
crecursionbisonflex-lexer

How to decide when change the state if there is recursion in Bison rule?


I want to handle this string: (ads AND qwe) OR zxc ... <KEYWORD> I have a start condition in Flex, which can catch 'any name', OR, AND and 'any word' (to catch invalid words). <KEYWORD> can be: SLEEP, WIDTH etc.

Bison rule is:

stmt:
     one_rule { }
   | one_rule AND stmt { }
   | one_rule OR stmt { }
   | '(' stmt ')' { }

In my case, KEYWORD also caught with 'any name' regex as I don't change the state, but if I change the state in one of the regexes, only the first part will work ((ads AND qwe))



UPDATED: Added Detailed Example

Valid inputs:

1. ((ads AND qwe) OR zxc) SLEEP 
2. ads AND qwe AND zxc WIDTH 
3. ads AND qwe

Invalid inputs:

1. ((ads AND qwe) OR zxc) BAD_KEYWORD 
2. ads AND qwe AND zxc AND
3. ads AND qwe OR

FLEX

<A>AND {
   position += yyleng;
   return AND;
}
<A>OR {
   position += yyleng;
   return OR;
}
<A>{NAME} {
   position += yyleng;
   yylval.str = strdup(yytext);
   return NAME;
}
<A>{ANY_TEXT} {
   position += yyleng;
   yylval.str = strdup(yytext);
   return ANY_TEXT;
}
<B>WIDTH {
   position += yyleng;
   return WIDTH;
}
<B>SLEEP {
   position += yyleng;
   return SLEEP;
}

In my implementation keyword is caught around A state. I need to understand the point where I can change the state to B


Solution

  • It seems like your intention is to allow certain words (such as SLEEP or WIDTH) to be used either as keywords or as search terms, depending on where they appear. Other words, such as AND and OR, are always special, so they will create syntax errors if they appear in a place where they shouldn't.

    If the above is correct, then the grammar should recognise:

    word AND other        -- statement with two rules
    word AND SLEEP        -- also a statement with two rules
    word AND other SLEEP  -- statement with two rules and a modifier ("keyword")
    

    However, the following are not legal:

    word AND AND          -- AND cannot be a rule
    word AND other AND    -- AND cannot be a modifier
    

    Thus, the keywords SLEEP and WIDTH are what are often called "contextual" or "semireserved" words.

    This language style does indeed create some difficulties for Bison/Flex based grammars, as anyone who has attempted to write an SQL parser will know. The parser generator Lemon, whose primary use case is to generate the Sqlite parser, implements a special %fallback declaration for this specific case. However, neither Yacc nor Bison (nor any other parser generator I know of offhand) has a similar facility, so that it needs to be implemented manually in the grammar.

    While it is theoretically possible to put the logic into the lexical analyser, as you seem to be contemplating, experience shows that such solutions are hard to write, debug and maintain, largely because they end up reproducing a lot of the parser logic in the lexical analyser. This duplication of logic is fragile; changes in the grammar may need to be accompanied by complicated rearrangement of the logic in the lexical analyser, and if the two components fall out of sync, valid sentences can be mysteriously rejected.

    The alternative of putting all of the logic in the parser is not without its problems either, but it's normally easier to maintain and debug. The basic structure is to always recognise the keywords as such in the lexical analysis (so start conditions are not necessary), and then fold the keywords back into the "identifier" category in the parser.

    Here's a simple example, based on the above theory of what your grammar looks like: (I left out "ANY_TEXT" because there is not even a hint what it's purpose might be.)

    %left OR
    %left AND
    %token SLEEP WIDTH
    %%
    expr: name
        | '(' expr ')'
        | expr AND expr
        | expr OR expr
    stmt: expr
        | expr WIDTH
        | expr SLEEP
    name: NAME
        | WIDTH { $$ = strdup("WIDTH"); }
        | SLEEP { $$ = strdup("SLEEP"); }
    

    The last set of productions might require some explanation. The first one simply allows a NAME to be used as a name, as is logical. The other two fold the contextual keywords WIDTH and SLEEP back into name in contexts where they cannot be used as keywords. (Care must be taken to not use name in a production where only NAME would be legal. That doesn't happen in this grammar, but in the general case it can require some careful thought. If you get it wrong, you will most likely get a parsing conflict error, and you can then use the -v report to help you debug the grammar.)

    The semantic action in the case of the conversion of the keywords into name is intended to maintain the invariant that the semantic value of a name is always a dynamically allocated string. Thus, it duplicates the lexical analyser's action for NAMEs, which is to create a dynamically allocated copy of the token. If the semantic value of SLEEP used as a name were a string literal, then you would not be able to simply call free() on the semantic value of name when you no longer needed the string. On the other hand, if you did the strdup in the lexical analyser, then you would need to remember to explicitly free() the semantic value of SLEEP and other keyword tokens in any grammar production in which they are not going to be turned into name. The style shown above is not common -- most people seem to prefer to jump through various hoops in order to avoid what looks like an unnecessary strdup -- but it is by far the easiest style to work with, and a few extra strdup() calls are not going to be noticeable during your parse (as long as you do, in fact, free() all dynamically allocated strings when you no longer need them).