clojure simple markov data transform

If I have a vector of words for example ["john" "said"... "john" "walked"...] and I want to make a hash map of each word and the number of occurrences of next word for example {"john" {"said" 1 "walked" 1 "kicked" 3}}

The best solution I came up with was recursively walking through a list by index and using assoc to keep updating the hash-map but that seems really messy. Is there a more idiomatic way of doing this?

Solution

Given you have words:

(def words ["john" "said" "lara" "chased" "john" "walked" "lara" "chased"])

Use this transformation-fn

(defn transform
  [words]
  (->> words
       (partition 2 1)
       (reduce (fn [acc [w next-w]]
                 ;; could be shortened to #(update-in %1 %2 (fnil inc 0))
                 (update-in acc
                            [w next-w]
                            (fnil inc 0))) 
               {})))

(transform words)
;; {"walked" {"lara" 1}, "chased" {"john" 1}, "lara" {"chased" 2}, "said" {"lara" 1}, "john" {"walked" 1, "said" 1}}

EDIT: You can gain performance using transient hash-maps like this:

(defn transform-fast
  [words]
  (->> (map vector words (next words))
       (reduce (fn [acc [w1 w2]]
                 (let [c-map (get acc w1 (transient {}))]
                   (assoc! acc w1 (assoc! c-map w2
                                          (inc (get c-map w2 0))))))
               (transient {}))
       persistent!
       (reduce-kv (fn [acc w1 c-map]
                    (assoc! acc w1 (persistent! c-map)))
                  (transient {}))
       persistent!))

Obviously the resulting source code doesn't look as nice and such optimization should only happen if it is critical.

(Criterium says it beats Michał Marczyks transform* being roughly two times as fast on King Lear).