This is a simple attempt to reproduce some code that Ross Ihaka gave as an example of poor R performance. I was curious as to whether Clojure's persistent data structures would offer any improvement. (https://www.stat.auckland.ac.nz/~ihaka/downloads/JSM-2010.pdf)
However , I'm not even getting to first base, with a Stack Overflow reported, and not much else to go by. Any ideas? Apologies in advance if the question has an obvious answer I've missed...
; Testing Ross Ihaka's example of poor R performance
; against Clojure, to see if persisntent data structures help
(def dd (repeat 60000 '(0 0 0 0)))
(defn repl-row [d i new-r]
(concat (take (dec i) d) (list new-r) (drop i d)))
(defn changerows [d new-r]
(loop [i 10000
data d]
(if (zero? i)
data
(let [j (rand-int 60000)
newdata (repl-row data j new-r)]
(recur (dec i) newdata)))))
user=> (changerows dd '(1 2 3 4))
StackOverflowError clojure.lang.Numbers.isPos (Numbers.java:96)
Further, if anyone has any ideas how persistent functional data structures can be used to best advantage in the example above, I'd be very keen to hear. The speedup reported not using immutable structures (link above) was about 500%!
Looking at the stack trace for the StackOverflowError
, this seems to be an "exploding thunk" (lazy/suspended calculation) problem that isn't obviously related to the recursion in your example:
java.lang.StackOverflowError
at clojure.lang.RT.seq(RT.java:528)
at clojure.core$seq__5124.invokeStatic(core.clj:137)
at clojure.core$concat$cat__5217$fn__5218.invoke(core.clj:726)
at clojure.lang.LazySeq.sval(LazySeq.java:40)
at clojure.lang.LazySeq.seq(LazySeq.java:49)
at clojure.lang.RT.seq(RT.java:528)
at clojure.core$seq__5124.invokeStatic(core.clj:137)
at clojure.core$take$fn__5630.invoke(core.clj:2876)
Changing this line to realize newdata
into a vector resolves the issue:
(recur (dec i) (vec newdata))
This workaround is to address the use of concat
in repl-row
, by forcing concat
's lazy sequence to be realized in each step. concat
returns lazy sequences, and in your loop
/recur
you're passing in the lazy/unevaluated results of previous concat
calls as input to subsequent concat
calls which returns more lazy sequences based on previous, unrealized lazy sequences. The final concat
-produced lazy sequence isn't realized until the loop finishes, which results in a stack overflow due its dependence on thousands of previous concat
-produced lazy sequences.
Further, if anyone has any ideas how persistent functional data structures can be used to best advantage in the example above, I'd be very keen to hear.
Since it seems the usage of concat
here is to simply replace an element in a collection, we can get the same effect by using a vector and assoc
-ing the new item into the correct position of the vector:
(def dd (vec (repeat 60000 '(0 0 0 0))))
(defn changerows [d new-r]
(loop [i 10000
data d]
(if (zero? i)
data
(let [j (rand-int 60000)
newdata (assoc data j new-r)]
(recur (dec i) newdata)))))
Notice there's no more repl-row
function, we just assoc
into data
using the index and the new value. After some rudimentary benchmarking with time
, this approach seems to be many times faster:
"Elapsed time: 75836.412166 msecs" ;; original sample w/fixed overflow
"Elapsed time: 2.984481 msecs" ;; using vector+assoc instead of concat
And here's another way to solve it by viewing the series of replacements as an infinite sequence of replacement steps, then sampling from that sequence:
(defn random-replace [replacement coll]
(assoc coll (rand-int (count coll)) replacement))
(->> (iterate (partial random-replace '(1 2 3 4)) dd)
(drop 10000) ;; could also use `nth` function
(first))