Search code examples
dataframejuliadataframes.jl

Julia Groupby with mean calculation


I have this dataframe:

d=DataFrame(class=["A","A","A","B","C","D","D","D"],
            num=[10,20,30,40,20,20,13,12], 
            last=[3,5,7,9,11,13,100,12])

and I want to do a groupby. In Python I would do:

d.groupby('class')[['num','last']].mean()

How can I do the same in Julia?

I am trying something to use combine and groupby but no success so far.

Update: I managed to do it this way:

gd = groupby(d, :class)
combine(gd, :num => mean, :last => mean)

Is there any better way to do it?


Solution

  • It depends what you mean by "a better way". You can apply the same function to multiple columns like this:

    combine(gd, [:num, :last] .=> mean)
    

    or if you had a lot of columns and e.g. wanted to apply mean to all columns exept a grouping column you could do:

    combine(gd, Not(:class) .=> mean)
    

    or (if you want to avoid having to remember which column was grouping)

    combine(gd, valuecols(gd) .=> mean)
    

    These are basic schemas. Now the other issue is how to give a name to your target columns. By default they get a name in a form "source_function" like this:

    julia> combine(gd, [:num, :last] .=> mean)
    4×3 DataFrame
     Row │ class   num_mean  last_mean
         │ String  Float64   Float64
    ─────┼─────────────────────────────
       1 │ A           20.0     5.0
       2 │ B           40.0     9.0
       3 │ C           20.0    11.0
       4 │ D           15.0    41.6667
    

    you can keep original column names like this (this is sometimes preferred):

    julia> combine(gd, [:num, :last] .=> mean, renamecols=false)
    4×3 DataFrame
     Row │ class   num      last
         │ String  Float64  Float64
    ─────┼──────────────────────────
       1 │ A          20.0   5.0
       2 │ B          40.0   9.0
       3 │ C          20.0  11.0
       4 │ D          15.0  41.6667
    

    or like this:

    julia> combine(gd, [:num, :last] .=> mean .=> identity)
    4×3 DataFrame
     Row │ class   num      last
         │ String  Float64  Float64
    ─────┼──────────────────────────
       1 │ A          20.0   5.0
       2 │ B          40.0   9.0
       3 │ C          20.0  11.0
       4 │ D          15.0  41.6667
    

    The last example shows you that you can pass any function as the last part that works on strings and generates you target column name, so you can do:

    julia> combine(gd, [:num, :last] .=> mean .=> col -> "prefix_" * uppercase(col) * "_suffix")
    4×3 DataFrame
     Row │ class   prefix_NUM_suffix  prefix_LAST_suffix
         │ String  Float64            Float64
    ─────┼───────────────────────────────────────────────
       1 │ A                    20.0              5.0
       2 │ B                    40.0              9.0
       3 │ C                    20.0             11.0
       4 │ D                    15.0             41.6667
    

    Edit

    Doing the operation in a single line:

    You can do just:

    combine(groupby(d, :class), [:num, :last] .=> mean)
    

    The benefit of storing groupby(d, :class) in a variable is that you perform grouping once and then can reuse the resulting object many times, which speeds up things.

    Also if you use DataFrmesMeta.jl you could write e.g.:

    @chain d begin
        groupby(:class)
        combine([:num, :last] .=> mean)
    end
    

    which is more typing, but this is style that people coming from R tend to like.