Search code examples
rubymultithreadinghashparallel-processingjruby

Issues with parallelizing the creation of key-value pairs in a Ruby hash


Working with Ruby and wrote the following code using Parallel and JRuby 1.7.19 to speed up the creation of a hash from an array with many values:

hash = {}
array = [
  {"id" => "A001", "value" => 1},
  {"id" => "B002", "value" => 0},
  {"id" => "C003", "value" => 3},
  {"id" => "D004", "value" => 0}]

Parallel.each(array, { in_threads: 5 }) do |item|
  if keep_item?(item)
    hash[item["id"]] = item
  end
end

def keep_item?(item)
  item["value"] > 0
end

It was brought to my attention that there could be issues with adding keys to hashes in parallel in Ruby. Is there any risks with this code (thread-safe, loss of data, strange locks I'm unaware of, etc) such that I should have just left it as a regular series #each call?


Solution

  • Hash isn't thread safe. If keep_item? visits the hash, there will be race condition. Even if it doesn't, there are concurrent updates to the hash, which is error prone.

    If there's no lock or other synchronization, theoretically there's no guarantee that the updates to a non-thread-safe hash on one thread is visible on other thread. The concurrent updates of hash without synchronization may lose data, or cause other strange issue. This depends on the implementation of the Ruby Hash.

    You data is simple enough, just process them using normal each. If you use Parallel, and add mutex/lock for thread safe access, the synchronization overhead will significantly add extra time cost to the overall process. And it's likely the safe parallel version will use more time.

    Parallel is useful when your task is IO bounded, or CPU bounded as long as you have free cores and the task doesn't need to exchange data between each other.