Search code examples
performanceclojurelevenshtein-distance

Efficient implementation of Damerau-Levenshtein distance


I'm trying to implement really efficient Clojure function to compute Damerau-Levenshtein distance. I've decided to use this algorithm (attached source should be C++) for computing Levenshtein distance and add some lines to make it work for DLD.

Here is what I've created in Common Lisp (I hope it could help):

(defun damerau-levenshtein (x y)
  (declare (type string x y)
           #.*std-opts*)
  (let* ((x-len (length x))
         (y-len (length y))
         (v0 (apply #'vector (mapa-b #'identity 0 y-len)))
         (v1 (make-array (1+ y-len) :element-type 'integer))
         (v* (make-array (1+ y-len) :element-type 'integer)))
    (do ((i 0 (1+ i)))
        ((= i x-len) (aref v0 y-len))
      (setf (aref v1 0) (1+ i))
      (do ((j 0 (1+ j)))
          ((= j y-len))
        (let* ((x-i (char x i))
               (y-j (char y j))
               (cost (if (char-equal x-i y-j) 0 1)))
          (setf (aref v1 (1+ j)) (min (1+ (aref v1 j))
                                      (1+ (aref v0 (1+ j)))
                                      (+  (aref v0 j) cost)))
          (when (and (plusp i) (plusp j))
            (let ((x-i-1 (char x (1- i)))
                  (y-j-1 (char y (1- j)))
                  (val (+ (aref v* (1- j)) cost)))
              (when (and (char-equal x-i y-j-1)
                         (char-equal x-i-1 y-j)
                         (< val (aref v1 (1+ j))))
                (setf (aref v1 (1+ j)) val))))))
      (rotatef v* v0 v1))))

Now, I fear I cannot translate it into really efficient and idiomatic Clojure code (in functional style?). I would really appreciate any suggestion and I think it may be quite useful for many future readers too.

P.S. I've found this implementation, but I doubt if it is efficient and it uses some obsolete contrib functions (deep-merge-with and bool-to-binary):

(defn damerau-levenshtein-distance
  [a b]
  (let [m (count a)
        n (count b)
        init (apply deep-merge-with (fn [a b] b)
                    (concat
                     ;;deletion
                     (for [i (range 0 (+ 1 m))]
                       {i {0 i}})
                     ;;insertion
                     (for [j (range 0 (+ 1 n))]
                       {0 {j j}})))
        table (reduce
               (fn [d [i j]]
                 (deep-merge-with
                  (fn [a b] b)
                  d
                  (let [cost (bool-to-binary (not (= (nth a (- i 1))
                                          (nth b (- j 1)))))
                        x
                          (min
                           (+ ((d (- i 1))
                               j) 1) ;;deletion
                           (+ ((d i)
                               (- j 1)) 1) ;;insertion
                           (+ ((d (- i 1))
                               (- j 1)) cost)) ;;substitution))
                        val (if (and (> i 1)
                               (> j 1)
                               (= (nth a (- i 1))
                                  (nth b (- j 2)))
                               (= (nth a (- i 2))
                                  (nth b (- j 1))))
                        (min x (+ ((d (- i 2))
                                   (- j 2)) ;;transposition
                                  cost))
                        x)]
                    {i {j val}})))
               init
               (for [j (range 1 (+ 1 n))
                     i (range 1 (+ 1 m))] [i j]))]
    ((table m) n)))

Solution

  • OK, this should do the trick (based on KIMA's answer):

    (defn da-lev [str1 str2]
      (let [l1 (count str1)
            l2 (count str2)
            mx (new-matrix :ndarray (inc l1) (inc l2))]
       (mset! mx 0 0 0)
       (dotimes [i l1]
         (mset! mx (inc i) 0 (inc i)))
       (dotimes [j l2]
         (mset! mx 0 (inc j) (inc j)))
       (dotimes [i l1]
         (dotimes [j l2]
           (let [i+ (inc i) j+ (inc j)
                 i- (dec i) j- (dec j)
                 cost (if (= (.charAt str1 i)
                             (.charAt str2 j))
                        0 1)]
             (mset! mx i+ j+
                    (min (inc (mget mx i j+))
                         (inc (mget mx i+ j))
                         (+ (mget mx i j) cost)))
             (if (and (pos? i) (pos? j)
                      (= (.charAt str1 i)
                         (.charAt str2 j-))
                      (= (.charAt str1 i-)
                         (.charAt str2 j)))
               (mset! mx i+ j+
                      (min (mget mx i+ j+)
                           (+ (mget mx i- j-) cost)))))))
       (mget mx l1 l2)))
    

    Please note that you need core.matrix library, which is not standard (despite its name). One can install it with Leiningen this way:

    [net.mikera/core.matrix "0.29.1"]
    

    The library lives in namespace clojure.core.matrix. To use this solution 'as is' you should 'add' symbols from the namespace into your namespace.