Search code examples
clojurehashmapmarkov

clojure simple markov data transform


If I have a vector of words for example ["john" "said"... "john" "walked"...] and I want to make a hash map of each word and the number of occurrences of next word for example {"john" {"said" 1 "walked" 1 "kicked" 3}}

The best solution I came up with was recursively walking through a list by index and using assoc to keep updating the hash-map but that seems really messy. Is there a more idiomatic way of doing this?


Solution

  • Given you have words:

    (def words ["john" "said" "lara" "chased" "john" "walked" "lara" "chased"])
    

    Use this transformation-fn

    (defn transform
      [words]
      (->> words
           (partition 2 1)
           (reduce (fn [acc [w next-w]]
                     ;; could be shortened to #(update-in %1 %2 (fnil inc 0))
                     (update-in acc
                                [w next-w]
                                (fnil inc 0))) 
                   {})))
    
    (transform words)
    ;; {"walked" {"lara" 1}, "chased" {"john" 1}, "lara" {"chased" 2}, "said" {"lara" 1}, "john" {"walked" 1, "said" 1}}
    

    EDIT: You can gain performance using transient hash-maps like this:

    (defn transform-fast
      [words]
      (->> (map vector words (next words))
           (reduce (fn [acc [w1 w2]]
                     (let [c-map (get acc w1 (transient {}))]
                       (assoc! acc w1 (assoc! c-map w2
                                              (inc (get c-map w2 0))))))
                   (transient {}))
           persistent!
           (reduce-kv (fn [acc w1 c-map]
                        (assoc! acc w1 (persistent! c-map)))
                      (transient {}))
           persistent!))
    

    Obviously the resulting source code doesn't look as nice and such optimization should only happen if it is critical.

    (Criterium says it beats Michał Marczyks transform* being roughly two times as fast on King Lear).