Search code examples
compiler-construction

Should every possible Ascii Character be added to a Transition Table for a Finite Automaton


I am trying to implement a Deterministic Finite Automaton to create a lexical analyzer for a project. When creating the transition table in C++, I created a map which uses the current state as the key and stores the possible transitions as another map in the value. Would I have to go through and add every possible transition for every possible input for all of the states, or is there a more efficient method for this task?


Solution

  • There are lots of optimisations available.

    If there is no transition with the current input character from the current state, then the lexer has found the match (but it might not be at the current input point, see below). So you have the option of storing a transition to a special "reject" state or of not storing a transition at all and using "reject" as the default lookup value. If you're using a map, then not storing non-transitions makes sense. But it's generally better to use a vector (or fixed-size array) which is a more efficient datatype, and use the character code as an integer in the index.

    Either way, you can almost always significantly cut down on table size by grouping characters into equivalence classes based on the set of transitions for that character. (That is, two characters are in the same equivalence class if every state either has the same transition on both characters, or has no transition on either character.) The equivalence classes can be represented by small integeers; in a typical lexical grammar, even with Unicode character sets, there are fifty or sixty equivalence classes, which is a lot fewer than 256 8-bit characters and enormously fewer than over a million Unicode characters.

    The transition table can be compressed a lot more; standard compression techniques are described in most relevant textbooks. But these days memory is cheap enough that the equivalence class compression is probably sufficient.

    The other issue is knowing what to do when you hit a state with no valid transition. There's no guarantee in general that the current state is an accepting state, so it's possible that you will have to back up the input cursor.

    For a simple example, consider a language like C in which . and ... are tokens but .. is not. As a result, the input:

    a.b       => VAR(a) '.' VAR(b)
    a..b      => VAR(a) '.' '.' VAR(b)
    a...b     => VAR(a) '...' VAR(b)
    

    If the input processed was .. and the next character is b, it's necessary to back the input up in order to accept only one point as a token.