Search code examples
dataframejuliagroupingcombinations

Share out a value on multiple entries with DataFrames in Julia


I'm new with the Julia programming language and I would like to group scores by country in a DataFrame like :

 Row │ Name       Score    Country       
     │ String15   Float64  String15   
─────┼────────────────────────────────
   1 │ Oliver         5.0  France
   2 │ Patrick        3.0  Spain
   3 │ Jules          2.0  France
   4 │ Steven         3.5  USA
   5 │ Karl           4.0  France
   6 │ Alexander      3.0  France/USA
   7 │ Julian         1.0  Spain/USA

I have grouped my data by Country with

combine(groupby(db_test, :Country), :Score=>sum)

and I get :

 Row │ Country     Score_sum 
     │ String15    Float64   
─────┼───────────────────────
   1 │ France           11.0
   2 │ Spain             3.0
   3 │ USA               3.5
   4 │ France/USA        3.0
   5 │ Spain/USA         1.0

But I would like to share the score of France/USA and Spain/USA to France, Spain and USA to obtain this :

 Row │ Country     Score_sum 
     │ String15    Float64   
─────┼───────────────────────
   1 │ France           12.5
   2 │ Spain             3.5
   3 │ USA               5.5

How can I share the value of a column according to the number and the combination of countries in another column ?


Solution

  • Here is a full code doing this. I do it step-by-step to make it easy to understand what is going on:

    julia> using CSV, DataFrames
    
    julia> data = """score,country
           5.0,France
           3.0,Spain
           2.0,France
           3.5,USA
           4.0,France
           3.0,France/USA
           1.0,Spain/USA"""
    "score,country\n5.0,France\n3.0,Spain\n2.0,France\n3.5,USA\n4.0,France\n3.0,France/USA\n1.0,Spain/USA"
    
    julia> df = CSV.read(IOBuffer(data), DataFrame)
    7×2 DataFrame
     Row │ score    country    
         │ Float64  String15   
    ─────┼─────────────────────
       1 │     5.0  France
       2 │     3.0  Spain
       3 │     2.0  France
       4 │     3.5  USA
       5 │     4.0  France
       6 │     3.0  France/USA
       7 │     1.0  Spain/USA
    
    julia> df.countrys = split.(df.country, "/")
    7-element Vector{Vector{SubString{String15}}}:
     ["France"]
     ["Spain"]
     ["France"]
     ["USA"]
     ["France"]
     ["France", "USA"]
     ["Spain", "USA"]
    
    julia> df.scores = df.score ./ length.(df.countrys)
    7-element Vector{Float64}:
     5.0
     3.0
     2.0
     3.5
     4.0
     1.5
     0.5
    
    julia> df2 = flatten(df, :countrys)
    9×4 DataFrame
     Row │ score    country     countrys   scores  
         │ Float64  String15    SubStrin…  Float64 
    ─────┼─────────────────────────────────────────
       1 │     5.0  France      France         5.0
       2 │     3.0  Spain       Spain          3.0
       3 │     2.0  France      France         2.0
       4 │     3.5  USA         USA            3.5
       5 │     4.0  France      France         4.0
       6 │     3.0  France/USA  France         1.5
       7 │     3.0  France/USA  USA            1.5
       8 │     1.0  Spain/USA   Spain          0.5
       9 │     1.0  Spain/USA   USA            0.5
    
    julia> combine(groupby(df2, :countrys), :scores=>sum)
    3×2 DataFrame
     Row │ countrys   scores_sum 
         │ SubStrin…  Float64
    ─────┼───────────────────────
       1 │ France           12.5
       2 │ Spain             3.5
       3 │ USA               5.5