I am translating some code from MATLAB to Julia. For context, I am relatively new to Julia and relatively seasoned in MATLAB. I have been struggling with the following specific operation that I do multiple times in my code. I need to sum the values of a variable by some pre-fixed ids. To make it very concrete think of the following operation:
│ ids │ values │ │ ids | sum_values |
┼───────┼────────┤ ┼───────┼────────────┤
│ 2 │ 1948.6 │ converted │ 1 │ 3995.4 │
│ 1 │ 1994.7 │ to │ 2 │ 1948.6 │
│ 3 │ 1940.1 │ ======> │ 3 │ 3844.4 │
│ 1 │ 2000.7 │ │ 4 │ 1982.0 │
│ 4 │ 1982.0 │
│ 3 │ 1904.3 │
But in reality think of these two as large arrays, ids
and values
(in Julia speak, they are Int64
and Float64
). Each of these two arrays in my case have about 4 million observations. Note that ids
is not necessarily sorted.
In MATLAB code, they look like the following:
rng(9491)
n_obs = 4*10^6; # number of observations
n_ids = 3*10^5; # number of unique ids
values = rand(n_obs,1);
ids = randsample(n_ids, n_obs,'true');
and in Julia:
using Random, StatsBase
Random.seed!(9491)
n_obs = 4*10^6 # number of observations
n_ids = 3*10^5 # number of unique ids
values = rand(n_obs);
ids = sample(1:n_ids, n_obs, replace=true);
So ids
groups observations while values
are just some values computed in an algorithm. In MATLAB, what I do to compute sum_values
is simply to use:
sum_values = accumarray(ids,values);
and I have been having a hard time figuring out what is an extremely time efficient implementation of the same operation. For context, I do about 200,000 calls of the function in different ways (that is, with around 20 different ids
, and thousands of different values
) in my optimization routine.
I stumbled upon this 7-year old answer to a similar question that uses DataFrames
but I don't know if this would fly in my setting where the key is to do it fast and then use the sum_values
in other matrix operations. Note that these are large arrays, so putting them in a DataFrame, then using groupby
then taking the values out, converting them into a Vector{Float64}
may actually not be a great idea computationally (happy to be proven wrong). Using dictionaries sounded like a good idea but then it involves sorting (for some reason by dictionaries end up not sorted in ids).
What is the most efficient way to sum values by group in Julia? Any advice on how to implement this, or something similar, would be greatly appreciated.
The optimal answer probably depends on whether you expect most groups to have any associated values. If there are values in most of the groups, then making an output vector is probably best, but if few groups have any values, then you can consider something like a Dict
.
Anyway, here's a solution assuming a high 'fill grade'.
# provide an output vector, out, to write into. If you already know how long out should be beforehand, you can use that
function accumarray!(out, ids, val)
for i in eachindex(ids, val)
out[ids[i]] += val[i]
end
return out
end
# Calculate length of output vector, pre-allocate and call
accumarray(ids, val) = accumarray!(zeros(eltype(val), maximum(ids)), ids, val)
This should be pretty efficient, but it's not completely trivial to parallelize.