i implemented a levenshtein trie to find similar words to a given word. my goal was to have a fast way to do spell correction.
However i found out that there is an even faster way to do that:
Levenshtein Automata
I just have a problem... I understand no word of that whats written here. Can someone explain me the idea and the basic functionality of a levenshtein automata in easy words?
Can someone explain me the idea and the basic functionality of a levenshtein automata in easy words?
A deterministic finite automaton (DFA) is
You can draw a DFA as a diagram like those in the paper. Conventionally, circular nodes are states. Directed edges each labeled with one character are transitions. Accepting states are marked as double-line circles. The start state has an inward pointing arrow with nothing at its tail.
A DFA accepts word W if and only if you can move a marker from the start state along transitions whose concatenated labels equal W to an accepting state. That is, if T is the transition function, and W = "cat", then T(T(T(Start, 'c'), 'a'), 't') must be an accepting state. Cycles in the transition function allow DFAs to accept strings of arbitrary length even though the DFA is finite.
In software a DFA is a simple loop and a table T(state, char) that implements the transition function.
current_state = START
while not end-of-input
c = get character from input
current_state = T(current_state, c)
end
if current_state is an accepting state return ACCEPT, else REJECT
The Wikipedia page on DFAs is not bad.
DFAs have nice properties. Accepting/rejecting an input of length N requires O(N) time (as long as the transition function runs in constant time). There is a unique minimum version of every DFA (among all those accepting the same set of words) and an efficient algorithm to find that minimum DFA. It's easy to compare DFAs for equality in time linear in the DFAs' size.
A Levenshtein Automaton L(W, d) for word W and Levenshtein distance d is just a DFA that accepts all words having Levenshtein distance at most d from W. That is, the automaton accepts W plus a bunch of other words that are W with no more than d "mistakes" in the usual sense of Levenshtein distance.
The contribution of the paper is a fast algorithm for computing Levenshtein DFAs and then a more advanced algorithm that accomplishes the same thing without computing the DFA explicitly.