Working with Ruby and wrote the following code using Parallel and JRuby 1.7.19 to speed up the creation of a hash from an array with many values:
hash = {}
array = [
{"id" => "A001", "value" => 1},
{"id" => "B002", "value" => 0},
{"id" => "C003", "value" => 3},
{"id" => "D004", "value" => 0}]
Parallel.each(array, { in_threads: 5 }) do |item|
if keep_item?(item)
hash[item["id"]] = item
end
end
def keep_item?(item)
item["value"] > 0
end
It was brought to my attention that there could be issues with adding keys to hashes in parallel in Ruby. Is there any risks with this code (thread-safe, loss of data, strange locks I'm unaware of, etc) such that I should have just left it as a regular series #each
call?
Hash
isn't thread safe. If keep_item?
visits the hash
, there will be race condition. Even if it doesn't, there are concurrent updates to the hash
, which is error prone.
If there's no lock or other synchronization, theoretically there's no guarantee that the updates to a non-thread-safe hash
on one thread is visible on other thread. The concurrent updates of hash
without synchronization may lose data, or cause other strange issue. This depends on the implementation of the Ruby Hash
.
You data is simple enough, just process them using normal each
. If you use Parallel
, and add mutex/lock
for thread safe access, the synchronization overhead will significantly add extra time cost to the overall process. And it's likely the safe parallel version will use more time.
Parallel
is useful when your task is IO
bounded, or CPU bounded as long as you have free cores and the task doesn't need to exchange data between each other.