Search code examples
parsingtokentokenizeformal-languages

context sensitive tokenization of code


I am working on a parser for a language that has

  • identifiers (say, a letter followed by a number of alphanumeric characters or an underscore),

  • integers (any number of digits and possibly carets ^),

  • some operators,

  • filename (a number of alphanumeric characters and possibly slashes, and dots)

Apparently filename overlaps integers and identifiers, so in general I cannot decide if I have a filename or, say, an identifier unless the filename contains a slash or a dot.

But filename can only follow a specific operator.

My question is how this situation is usually handled during tokenization? I have a table driven tokenizer (lexer), but I am not sure how to tell a filename from either an integer or an identifier. How is this done?

If filename was a superset of integers and identifiers then I probably could have grammar productions that could handle that, but the tokens overlap...


Solution

  • Flex and other lexers have the concept of start conditions. Essentially the lexer is a state machine and its exact behaviour will depend on its current state.

    In your example, when your lexer encounters the operator preceding a filename it should switch to a FilenameMode state (or whatever) and then switch back once it has produced the filename token it was expecting.

    EDIT:

    Just to give some concrete code this side of the hyperlink:

    You would trigger your FILENAME_MODE when you encounter the operator...

    {FILENAME_PREFIX} { BEGIN(FILENAME_MODE); }
    

    You would define your rule to parse a filename:

    <FILENAME_MODE>{FILENAME_CHARS}+ { BEGIN(INITIAL); }
    

    ...switching back to the INITIAL state in the action.