Search code examples
dataframematlabsumjuliaaccumarray

Efficiently Summing Values by Group in Julia similar to MATLAB's accumarray


I am translating some code from MATLAB to Julia. For context, I am relatively new to Julia and relatively seasoned in MATLAB. I have been struggling with the following specific operation that I do multiple times in my code. I need to sum the values of a variable by some pre-fixed ids. To make it very concrete think of the following operation:

│ ids   │ values │                      │ ids   | sum_values |
┼───────┼────────┤                      ┼───────┼────────────┤
│ 2     │ 1948.6 │      converted       │ 1     │ 3995.4     │
│ 1     │ 1994.7 │          to          │ 2     │ 1948.6     │
│ 3     │ 1940.1 │       ======>        │ 3     │ 3844.4     │
│ 1     │ 2000.7 │                      │ 4     │ 1982.0     │
│ 4     │ 1982.0 │                      
│ 3     │ 1904.3 │                      

But in reality think of these two as large arrays, ids and values (in Julia speak, they are Int64 and Float64). Each of these two arrays in my case have about 4 million observations. Note that ids is not necessarily sorted.

In MATLAB code, they look like the following:

rng(9491)

n_obs = 4*10^6; # number of observations
n_ids = 3*10^5; # number of unique ids

values = rand(n_obs,1);
ids = randsample(n_ids, n_obs,'true');

and in Julia:

using Random, StatsBase

Random.seed!(9491)

n_obs = 4*10^6  # number of observations
n_ids = 3*10^5  # number of unique ids

values = rand(n_obs);
ids = sample(1:n_ids, n_obs, replace=true);

So ids groups observations while values are just some values computed in an algorithm. In MATLAB, what I do to compute sum_values is simply to use:

sum_values = accumarray(ids,values);

and I have been having a hard time figuring out what is an extremely time efficient implementation of the same operation. For context, I do about 200,000 calls of the function in different ways (that is, with around 20 different ids, and thousands of different values) in my optimization routine.

I stumbled upon this 7-year old answer to a similar question that uses DataFrames but I don't know if this would fly in my setting where the key is to do it fast and then use the sum_values in other matrix operations. Note that these are large arrays, so putting them in a DataFrame, then using groupby then taking the values out, converting them into a Vector{Float64} may actually not be a great idea computationally (happy to be proven wrong). Using dictionaries sounded like a good idea but then it involves sorting (for some reason by dictionaries end up not sorted in ids).

What is the most efficient way to sum values by group in Julia? Any advice on how to implement this, or something similar, would be greatly appreciated.


Solution

  • The optimal answer probably depends on whether you expect most groups to have any associated values. If there are values in most of the groups, then making an output vector is probably best, but if few groups have any values, then you can consider something like a Dict.

    Anyway, here's a solution assuming a high 'fill grade'.

    # provide an output vector, out, to write into. If you already know how long out should be beforehand, you can use that
    
    function accumarray!(out, ids, val)
        for i in eachindex(ids, val)
            out[ids[i]] += val[i]
        end
        return out
    end
    
    # Calculate length of output vector, pre-allocate and call
    accumarray(ids, val) = accumarray!(zeros(eltype(val), maximum(ids)), ids, val)
    

    This should be pretty efficient, but it's not completely trivial to parallelize.