Search code examples
algorithmoptimizationlevenshtein-distance

Finding closest neighbour using optimized Levenshtein Algorithm


I recently posted a question about optimizing the algorithm to compute the Levenshtein Distance, and the replies lead me to the Wikipedia article on Levenshtein Distance.

The article mentioned that if there is a bound k on the maximum distance a possible result can be from the given query, then the running time can be reduced from O(mn) to O(kn), m and n being the lengths of the strings. I looked up the algorithm, but I couldn't really figure out how to implement it. I was hoping to get some leads on that here.

The optimization is #4 under "Possible Improvements".

The part that confused me was the one that said that we only need to compute a diagonal stripe of width 2k+1, centered on the main diagonal (the main diagonal is defined as coordinates (i,i)).

If someone could offer some help/insight, I would really appreciate it. If needed, I can post the complete description of the algorithm in the book as an answer here.


Solution

  • I've done it a number of times. The way I do it is with a recursive depth-first tree-walk of the game tree of possible changes. There is a budget k of changes, that I use to prune the tree. With that routine in hand, first I run it with k=0, then k=1, then k=2 until I either get a hit or I don't want to go any higher.

    char* a = /* string 1 */;
    char* b = /* string 2 */;
    int na = strlen(a);
    int nb = strlen(b);
    bool walk(int ia, int ib, int k){
      /* if the budget is exhausted, prune the search */
      if (k < 0) return false;
      /* if at end of both strings we have a match */
      if (ia == na && ib == nb) return true;
      /* if the first characters match, continue walking with no reduction in budget */
      if (ia < na && ib < nb && a[ia] == b[ib] && walk(ia+1, ib+1, k)) return true;
      /* if the first characters don't match, assume there is a 1-character replacement */
      if (ia < na && ib < nb && a[ia] != b[ib] && walk(ia+1, ib+1, k-1)) return true;
      /* try assuming there is an extra character in a */
      if (ia < na && walk(ia+1, ib, k-1)) return true;
      /* try assuming there is an extra character in b */
      if (ib < nb && walk(ia, ib+1, k-1)) return true;
      /* if none of those worked, I give up */
      return false;
    }
    

    Added to explain trie-search:

    // definition of trie-node:
    struct TNode {
      TNode* pa[128]; // for each possible character, pointer to subnode
    };
    
    // simple trie-walk of a node
    // key is the input word, answer is the output word,
    // i is the character position, and hdis is the hamming distance.
    void walk(TNode* p, char key[], char answer[], int i, int hdis){
      // If this is the end of a word in the trie, it is marked as
      // having something non-null under the '\0' entry of the trie.
      if (p->pa[0] != null){
        if (key[i] == '\0') printf("answer = %s, hdis = %d\n", answer, hdis);
      }
      // for every actual subnode of the trie
      for(char c = 1; c < 128; c++){
        // if it is a real subnode
        if (p->pa[c] != null){
          // keep track of the answer word represented by the trie
          answer[i] = c; answer[i+1] = '\0';
          // and walk that subnode
          // If the answer disagrees with the key, increment the hamming distance
          walk(p->pa[c], key, answer, i+1, (answer[i]==key[i] ? hdis : hdis+1));
        }
      }
    }
    // Note: you have to edit this to handle short keys.
    // Simplest is to just append a lot of '\0' bytes to the key.
    

    Now, to limit it to a budget, just refuse to descend if hdis is too large.