Search code examples
lex

Implicit termination of start conditions in flex without using `unput()`


Let's say I'm parsing hexadecimal numbers in flex. I have something like this:

%x hexnumber
%%
"0x"                { BEGIN hexnumber }
<hexnumber>[0-9A-F] { process_digit(); }

This works fine; the 0x prefix starts hex-parsing mode, and then each digit is processed in turn.

The problem is that a hex constant doesn't have an explicit terminator token. So, how do I switch back to the INITIAL state? By the time I know that the next character isn't part of the numeric constant, it's been consumed.

I can always push it back onto the input stream with unput():

<hexnumber>.        { unput(*yytext); BEGIN INITIAL; }

...but I'd very much prefer not do this (because of implementation details beyond the scope of this question using unput() is very expensive for me).

I know that the generated state machine is capable of automatically switching back to the INITIAL state without consuming the next character, because otherwise rules like [0-9A-F]+ wouldn't work. Is there a way to achieve this using explicit start conditions?


Solution

  • Use yyless(0) instead of unput(*yytext); yyless is essentially free since it only adjusts a couple of pointers. It makesno attempt to reallocate or move the input buffer. (You also need BEGIN(INITIAL), of course.)

    A much messier solution would be to use trailing context to distinguish between hex characters followed by other hex characters:

     [[:xdigit:]]/[[:xdigit:]]    process_digit();
     [[:xdigit:]]                 process_digit(); BEGIN(INITIAL);
    

    But that is a lot less flexible.