compiler-construction state-machine lexical-analysis

literals extraction policy for a lexical Analyzer

I have built a lexical analyzer for a C like language which for example given this input produces the following result.

Input

int i = 0 ; int j = i + 3;

Output

int    KEYWORD
i      IDENTIFIER
=      OPERATOR
;      PUNCTUATION
int    KEYWORD
j      IDENTIFIER
=      OPERATOR
i      IDENTIFIER
+      OPERATOR
3      INTEGER_CONSTANT
;      PUNCTUATION

In the above example you may have noticed the given input was syntactically correct, however when I give it something like below it fails.

Input

int i = "1.2.2222.+\<++++

I have made a class whose sole purpose is to break the above string into small parts (i call them literals , don't know if it is the correct term)that can be matched with regex or validated with DFA.

Problem arises with the ambiguous situations like + where + can either be an addition operator, or a part of an upcoming integer literal or even part of an increment operator. My teacher requirement is explained in the next paragraph.

if a + is preceded by a + it should be processed as an increment operator. In simple words the program must try to look for every possibility and choose the best. That means if the program has some valid input then some invalid input the again some valid input it should not stop at that invalid input instead keep finding the correct literals. For me though I am against it. My argument is if a program string becomes invalid at a certain index it should stop processing because we are not writing an error checking system after all.

I have tried to code all possibilities using a complex (for me) nested if else structure and gotten partial success. Can nay of you suggest me a simpler and elegant solution. I have also thought of structuring this problem into a state machine but I am not too sure because I have never implemented a state machine before other than the a DFA that can just tell yes or no for pattern matching.

As you can see it is a homework question but I am not asking for just code.

Solution

The usual approach to lexical analysis is to use the "maximal munch" algorithm: the input stream is divided into tokens by repeatedly taking the longest prefix which could be a single token. See this answer for one algorithm.

It is occasionally necessary to make exceptions to this rule (in c++, for example, <:: is normally lexed into <, ::) but on the whole, the maximal munch rule is easy to implement and, more importantly, to read.