Search code examples
rubyenumerable

Enumerator::Lazy and Garbage Collection


I am using Ruby's built in CSV parser against large files.

My approach is to separate the parsing with the rest of the logic. To achieve this I am creating an array of hashes. I also want to take advantage of Ruby's Enumerator:: Lazy to prevent loading the entire file in memory.

My question is, when I'm actually iterating through the array of hashes, does the Garbage collector actually clean things up as I go or will it only clean up when the entire array can be cleaned up, essentially still allowing the entire value in memory still?

I'm not asking if it will clean each element as I finish with it, only if it will clean it before the entire enum is actually evaluated.


Solution

  • When you iterate over a plain old array, the garbage collector has no chance to do anything. You can help the garbage collector by writing nil into the array position after you no longer need the element, so that the object in this position may now be free for collection.

    When you correctly use lazy enumerator, you are not iterate over an array of hashes. Instead you enumerate over the hashes, handling one after the other, and each one is read on demand.

    So you have the chance to use much less memory (depending on your further processing, and that it does not hold the objects in memory anyway)

    the structure may look like this:

    enum = Enumerator.new do |yielder|
      csv.read(...) do
         ...        
         yielder.yield hash
      end
    end
    
    enum.lazy.map{|hash| do_something(hash); nil}.count
    

    You also need to make sure that you are not generate the array again in the last step of the chain.