Search code examples
multithreadingconcurrencyclojuregarbage-collection

Why does looping over a large amount of data in another thread cause an overactive GC, and prevent some data from being freed?


I'm writing code that takes some lazy results produced by pmap, and draws them onto a BufferedImage. For three days now I've been trying to figure out why the drawing suddenly starts freezing and eventually halts about a 1/3 of the way through.

I've finally narrowed it down to the fact that I'm looping over a large amount of data in another thread.

This is the best MCVE that I've come up with:

(ns mandelbrot-redo.irrelevant.write-image-mcve
  (:import [java.awt.image BufferedImage]
           (java.util.concurrent Executors Executor)))

(defn lazy-producer [width height]
  (for [y (range height)
        x (range width)]
    [x y (+ x y)]))

; This works fine; finishing after about 5 seconds when width=5000
(defn sync-consumer [results width height]
  (time
    (doseq [[i result] (map vector (range) results)]
      (when (zero? (rem i 1e6))
        (println (str (double (/ i (* width height) 0.01)) "%")))

      ((fn boop [x] x) result)))) ; Data gets consumed here

; This gets to ~30%, then begins being interupted by 1-4 second lags
(defn async-consumer [results width height]
  (doto
    (Thread. ^Runnable
             (fn []
               (sync-consumer results width height)
               (println "Done...")))
    (.start)))

(defn -main []
  (let [width 5000
        height (int (* width 2/3))]
    (-> (lazy-producer width height)
        (async-consumer width height))))

When -main is run with sync-consumer, it finishes after a few seconds. With async-consumer however, it gets to about 25%, then begins slowing to a crawl to the point where the last printed percentage is 30%. If I leave it, I get an OOME.

If I use an explicit Thread., or use a local thread pool in async-consumer, it hangs and crashes. If I use future however, it finishes fine, just like sync-consumer.

The only hint I've gotten is that when I run this in VisualVM, I see that I have runaway allocation of Longs when using the async version:

VisualVM Snapshot

The sync version shows a peak amount of Longs to be about 45mb at once in comparison.

The CPU usage is quite different too:

enter image description here

There's massive GC spikes, but it doesn't seem like the Longs are being disposed of.

I could use future for this, but I've been bitten by its exception swallowing behavior so many times, I'm hesitant.

Why is this happening? Why is running this in a new thread causing the GC to go crazy, while at the same time numbers aren't being freed?

Can anyone explain this behavior?


Solution

  • The sync version seems to be processing through the 16M+ results and will not hold onto the head of the results seq due to locals clearing. This means that as you go, values are created, processed, and GC'ed.

    The async one closes over results in the fn and will hold the head, keeping all 16M+ values in memory, likely leading to GC thrashing?

    I actually can't reproduce what you describe - both sync and async take about the same time for me as written above. (Clojure 1.9, Java 1.8).