Search code examples
dataframejuliameanmissing-data

Replace missing values with mean in each column in dataframe Julia


I would like to impute the missing values with their mean value per column in a dataframe. Here is some sample data:

using DataFrames
using Statistics
df = DataFrame(x = [1, missing, 3], y = [missing, 2, 3], z = [4, 4,  missing])

3×3 DataFrame
 Row │ x        y        z       
     │ Int64?   Int64?   Int64?  
─────┼───────────────────────────
   1 │       1  missing        4
   2 │ missing        2        4
   3 │       3        3  missing 

I know you could use replace! with mean and skipmissing to impute a single column like this:

replace!(df.x, missing => mean(skipmissing(df.x)))

3-element Vector{Union{Missing, Int64}}:
 1
 2
 3

3×3 DataFrame
 Row │ x       y        z       
     │ Int64?  Int64?   Int64?  
─────┼──────────────────────────
   1 │      1  missing        4
   2 │      2        2        4
   3 │      3        3  missing

But using eachcol to impute each column like this doesn't work:

replace!(eachcol(df), missing => mean(skipmissing(eachcol(df))))

MethodError: no method matching _replace!(::Base.var"#new#352"{Tuple{Pair{Missing, Vector{Missing}}}}, ::DataFrames.DataFrameColumns{DataFrame}, ::DataFrames.DataFrameColumns{DataFrame}, ::Int64)

So I was wondering if anyone knows how to replace the missing values in each column with their mean value in Julia?


Solution

  • So, replace! is incorrect. To see why note that:

    julia> x = [1, missing, 2]
    3-element Vector{Union{Missing, Int64}}:
     1
      missing
     2
    
    julia> replace!(x, missing => mean(skipmissing(x)))
    ERROR: InexactError: Int64(1.5)
    

    You need the following (assuming you want an in-place operation):

    julia> transform!(df, All() .=> (x -> replace(x, missing => mean(skipmissing(x)))) => identity)
    3×3 DataFrame
     Row │ x        y        z
         │ Float64  Float64  Float64
    ─────┼───────────────────────────
       1 │     1.0      2.5      4.0
       2 │     2.0      2.0      4.0
       3 │     3.0      3.0      4.0
    

    Note the promotion of column eltypes.

    This is a general approach. An alternative that is valid in your case as you want to replace all columns would be:

    julia> mapcols!(df) do x
               replace(x, missing => mean(skipmissing(x)))
           end
    3×3 DataFrame
     Row │ x        y        z
         │ Float64  Float64  Float64
    ─────┼───────────────────────────
       1 │     1.0      2.5      4.0
       2 │     2.0      2.0      4.0
       3 │     3.0      3.0      4.0
    

    Yet another way to do it would be:

    julia> df[!, :] .= coalesce.(df, permutedims((mean∘skipmissing).(eachcol(df))))
    3×3 DataFrame
     Row │ x      y        z
         │ Int64  Float64  Float64
    ─────┼─────────────────────────
       1 │     1      2.5      4.0
       2 │     2      2.0      4.0
       3 │     3      3.0      4.0
    

    Note a difference. Broadcasting does not change eltype of column :x as it is not needed because the mean happens to be integer.