Search code examples
algorithmsubstringautomaton

How to understand the process of DFA construction in KMP algorithms


I'm learning KMP algorithm in the Book Algorithms 4th. I could understand most of the algorithm but have been stuck at process of dfa construction for a couple of days.

Take the pattern ABABAC for example. When there's a mismatch at C(state of dfa is 5), we should shift right a character in the text. So the pattern characters that we have known are BABA. However, how to figure out the next state of the dfa during construction? I failed to understand the text below:

For example, to decide what the DFA should do when we have a mismatch at j=5, for ABABAC, we use the DFA to learn that a full backup would leave us in state 3 for BABA, so we can copy dfa[][3] to dfa[][5].

What does it mean by "a full backup would leave us in state 3 for BABA", and how to get this conclusion while there's no specified input? And I can't understand the graph left to the text. Could anyone explain what it means? I have tried to understand it by myself for a couple days, but still could not get it. Thank you!

And you can read the segment of Algorithms 4th here.

dfa construction


Solution

  • When you're matching the input string, you can only get into state 5 after matching the first 5 characters of the pattern, and the first 5 characters of the pattern are ABABA. So no matter which input string you use, you know that the text preceding state 5 is "ABABA".

    So you if you get a mismatch in state 5, you could back up 4 characters and try matching again. But since you know what text has to appear before state 5, you don't actually need the input text to figure out what would happen. You can figure out beforehand what state you'd end up in when you got back to the same place.

    backup 4 characters and go to state 0:

    0 : BABA

    A doesn't match, so advance and go to state 0

    0: ABA

    A matches, so go to state 1

    1: BA

    B matches, go to state 2

    2: A

    A matches, go to state 3

    3:

    now we're back to the place in the input where we saw state 5 before, but now we're in state 3.

    This will always happen when we get a mismatch in state 5, so instead of actually doing this, we just make a note that says "when we get a mismatch in state 5, go to state 3".

    Note that most KMP implementations will actually make a failure table where failure_table[5]=3. Your example implementation is building the full DFA[char][state] instead, so it copies all the transitions from state 3 into state 5 for the failure cases. That says "when we get a mismatch in state 5, do whatever state 3 does", which works out the same.

    UNDERSTAND EVERYTHING ABOVE BEFORE MOVING ON

    Now let's speed up the calculation of those failure states...

    When we get a mismatch in state 5, we can use the DFA we have so far to figure out what would happen if we backed up and rescanned the input starting at the next possible match, by applying the DFA to "BABA". We end up in state 3, so let's call state 3 the "failure state" for state 5.

    It looks like we have to process 4 pattern charcters to calculate the failure state for state 5, but we already did most of that work when we calculated the failure state for state 4 -- we applied the DFA to "BAB" and ended up in state 2.

    So to figure out the failure state for state 5, we just start in the failure state for state 4 (state 2), and process the next character in the pattern -- the "A" that came after state 4 in the input.