Search code examples
replaceclojurebindingvarlet

Clojure, replacing vars in let bindings causes performance issue?


Let's say there is a function and it received a json string which contains about 100KB of data. The string gets converted to a map and then it keeps associating new keys to that map and replacing the var in let bindings like below:

(defn myfn
  [str]
  (let [j (json/read-str str)
        j (assoc-in j [:key1 :subkey1] somedata1)
        j (assoc-in j [:key2 :subkey2] somedata2)
        ....
        j (assoc-in j [:key100 :subkey100] somedata100)]
    ... do something ...))

I know, after all those let bindings, j will have all those new keys added. This is just an example. I wonder what happens inside those lots of bindings to the same var.

I mean what happens in the memory. Would that copy 100KB 100 times in the memory? And it eats up 100KB * 100 = 10,000KB until it gets out of that function? Or, Clojure is smart enough and it actually keeps adding new keys in the same memory space?

If you could also recommend where I should look for in Clojure reference to find an answer to this, that would be really nice.


Solution

  • Clojure uses a data structure called a trie, that is similar to a tree, but which only has data at the leaf nodes. Most of clojure's persistent structures are implemented as a trie.

    This excellent article really explains things in detail and uses vectors, so I won't rehash it here. I know on S.O. it's preferred to give the content rather than a link, but it's not a topic that can be covered fully in an answer here, and the article does it best, so I'll just link to it.

    In short, when a data structure is modified in some way, a new trie is created for the new "version", and instead of copying all the data over from the old to the new with one change made, the nodes in the new structure point to existing data. Here is a visualization from the above article that shows data sharing:

    Taken from http://hypirion.com/musings/understanding-persistent-vector-pt-1

    So, using this structure, we have shared data, but since it is only a binary trie, it can get deep very quickly, so lookups could take a very long time (for a vector of 1 billion elements, the depth to get to a leaf node is log21e9 which is 30). To get around this, clojure uses a 32-way branching factor instead of a 2-way one, yielding trees that are very shallow. So, the same vector that holds 1 billion elements in clojure only takes log321e9, or 6 levels of indirection, to reach the leaves.

    I encourage you to read the article, and also have a look at PersistentHashMap, and you will see references to shift + 5 in several places. This is a clever way to use bit-shifting to index into the trie (log232 = 5). See the second part of the article for more in-depth info on this.

    To summarize, clojure uses efficient data structures to achieve persistence, and any language which features immutability as a core feature must do this, if it hopes to achieve usable performance.