Search code examples
yacclexambiguous

Parse same pattern differently depending on context with lex/yacc


My problem is that I have an identical pattern of characters that I wish to parse differently given their context. In one part of the file I need to parse a version number of the form "#.#" Obviously, this is identical to a floating point number. The lexer always opts to return a floating point number. I think I can switch the sequence of rules to give the version # precedence(?), but that doesn't do me any good when I need to later parse floating point numbers.

I suppose I could forget about asking the parser to return each piece of the version separately and split the floating point number into pieces, but I had hoped to have it done for me.

There is actually a bit more to the context of the version #. Its complete form is "ABC #.# XYZ" where "ABC" and "XYZ" never change. I have tried a couple of things to leverage the context of the version #, but haven't made it work.

Is there a way to provide some context to the lexer to parse the components of the version? Am I stuck with receiving a floating point number and parsing it myself?


Solution

  • You have a few possibilities.

    The simplest one is to do the string to number conversion in the parser instead of the scanner. That requires making a copy of the number as a string, but the overhead should not be too high: malloc of short strings is well-optimized on almost all platforms. And the benefit is that the code is extremely straightforward and robust:

    Parser

    %union {
          char*  string;
          double number;
          // other types, including version
    }
    
    %token <string> DOTTED
    %token <number> NUMBER
    %type <number> number
    %type <version> version
    %%
    number : NUMBER
           | DOTTED { $$ = atod($1); free($1); }
    version: DOTTED { $$ = make_version($1); free($1); }
    

    Scanner

    [[:digit:]]+\.[[:digit:]]+     { yylval.string = strdup(yytext); return DOTTED; }
    [[:digit:]]+\.?|\.[[:digit:]]+ { yylval.number = atod(yytext); }
    

    The above assumes that version numbers are always single-dotted, as in the OP. In applications where version numbers could have multiple dots or non-numeric characters, you would end up with three possible token types: unambiguous numbers, unambiguous version strings, and single-dotted numeric strings which could be either. Aside from adding the VERSION token type and the pattern for unambiguous version strings to the scanner, the only change is to add | VERSION to the version production in the parser.

    Another possibility, if you can easily figure out in the scanner whether a number or a version is required, is to use a start condition. You can also change the condition from the parser but it's more complicated and you need to understand the parsing algorithm in order to ensure that you are communicating the state changes correctly.

    Finally, you could do both conversions in the scanner, and pick the correct one when you reduce the token in the parser. If the version is a small and simple data structure, this might well turn out to be optimal.