I want to build a compiler from scratch. So in the first step, I want to build a scanner. But I am stuck with the problem that how should I treat relationship operator like ">=, ==, <="? Should I simply treat them as "> then =, = then =, < then =" or as a whole part? By the way, how about "++" and "--"? Thanks for help!
Scanning algorithms are usually designed to match the largest possible token. In order to do this, they employ a "maximal munch" algorithm that backtracks whenever an error is encountered, and continues by resetting the state.
The simplest implementation strategy for a lexer is to construct a DFA that models the union of every token's regex. So if your language's tokens consisted of =
or ===
(but not ==
) delimited by whitespace (which I'll model explicitly for this answer as being _
) then your lexer would basically be using the state of a DFA that accepts =|===|__*
(notice that the rule for whitespace is "at least one _
", _+
).
As you can see, there's an intermediary state (which you'd reach on input ==
but not accept as it's not an accept state). So, assuming you've got your DFA computed (in practice, there's shortcuts for compacting ranges and other, generalised, predicates on each input character), you can begin to write a state machine.
DFAs can be written as a loop that, for each state (using a switch
statement with a case
per state), scrutinises the current state against the current character in the source buffer. If a transition is valid - i.e. current state w/ current input character exist as an outgoing arrow in the DFA - you set the current state to that state (and keep track of whether it's accepting or not).
In order to make this work in a maximal munch fashion, you maintain two offsets into the input buffer, previous
and current
. Initially, these are both 0. You advance current
and keep track of the last accept state during each transition. The idea is that if no such transition exists at any point, you can backtrack/rewind to the last valid accept state. So, if no such transition exists, you can rewind, yield a token, and then update the previous (to match the current, error-ing, offset), and reset the state to the initial state.
Suppose we're lexing ===____=
. The lexer would be in the initial state, see =
and then move into an intermediary (accepting) state. From there, it would see the next =
and decide to move into the next (non-accepting) state. Then, it'd see the final =
and move into the final accept state on that chain. Once it has seen ===
, it will see the start of a new lexeme by noticing there there's no valid transition from the accept state for ===
and _
. So it would yield ===
(by taking the substring of length current-prev
, starting from prev
), reset the machine, and then start examining the sequence of _
s from the initial state. A similar story would happen to yield ____
and =
(using a special EOF
token that all lexers tend to have when the input is exhausted).
What I've described here is a simplistic form of the maximal munch algorithm. It's simplistic because the rewinding will never have to more than one input token. There's descriptions of situations where you may wish to backtrack further and where the naive algorithm (that performs limited book-keeping) could have worst-case O(n^2) time complexity due to cleverly-constructed backtracking scenarios.
Long story short, you differentiate between tokens that are truncations of other tokens by ensuring that your lexer recognises both but chooses to consume the largest possible (accepting) input. You can experiment with this algorithm by following it through on the DFA and maintaining the two offsets into the input you're lexing. You can read more about maximal munch here. Another helpful resource is a paper that describes how to make this algorithm linear time, “Maximal-Munch” Tokenization in Linear Time - the algorithm I've described above, alongside its improvement, is on page 5, albeit described a bit differently.