Search code examples
compiler-constructiongrammarlexerlexical-analysisformal-languages

Table-Driven Lexers - What about reserved keywords?


This question stems from another question I have asked over on the CS site. Reference

I have tried searching through online course notes from various Universities so that I can find the answer to a problem I am facing.

I have to implement a compiler for a custom language for an assignment. This language contains some atomic symbols like letters from the English alphabet, and digits. And I managed to find examples for these, and they're quite straightforward. For example: Go to Page 25

However, this language also contains reserved words such as if and for.

This is where I'm having problems. Suppose the lexer is trying to read the string "if (expression) statement". If I use an implementation like Page 4 it will incorrectly categorise if as an identifier.

So my idea is to implement a "lookahead" mechanism so that before the lexer categorises and sends to the DFA what is being read, it can make an informed, correct decision.

For example: The lexer encounters i. Since i can belong to a reserved word (if), the lexer should check for the next character. If it is f then the lexer should make sure it's not actually a normal string that happens to start with if, like ifxyz.

I like this idea, except I haven't managed to find anything similar from looking at course notes online, which makes me think that perhaps I'm doing something wrong.

UPDATE!! This is for those who got here through a search trying to find a solution. It's been a while, I've actually solved the issue and the answer linked in the comments is very helpful. I suggest you go read it.

Here's how I ended up solving this:


START(f) -> F

F(o) -> FO

FO(r) -> FOR

FOR(_) -> IDENTIFIER

Furthermore, all states have a "Lex As" property. Reason for this: Consider you arrive at state F with no further input. Therefore, you should assume it's an identifier (in most languages). Hence, F.lexAs would return the correct interpretation of the state, in this case, IDENTIFIER.


Solution

  • Your lookahead example really is just like a DFA in itself. Sadly, there is no easy way to solve this problem besides hard coding the keywords into the DFA you are using.

    For the if example, I would make a token type called IF which is different from your ID token type.

    Now, you must change your DFA to accept an IF token. If we are at the starting state and we read an i the DFA should not start the normal ID path, it should go down a separate path.

    Here's an example DFA for interpreting just the IF and ID tokens, and accepting only characters a-z.

    Keyword DFA