I would like to impute the missing values with their mean value per column in a dataframe. Here is some sample data:
using DataFrames
using Statistics
df = DataFrame(x = [1, missing, 3], y = [missing, 2, 3], z = [4, 4, missing])
3×3 DataFrame
Row │ x y z
│ Int64? Int64? Int64?
─────┼───────────────────────────
1 │ 1 missing 4
2 │ missing 2 4
3 │ 3 3 missing
I know you could use replace!
with mean and skipmissing
to impute a single column like this:
replace!(df.x, missing => mean(skipmissing(df.x)))
3-element Vector{Union{Missing, Int64}}:
1
2
3
3×3 DataFrame
Row │ x y z
│ Int64? Int64? Int64?
─────┼──────────────────────────
1 │ 1 missing 4
2 │ 2 2 4
3 │ 3 3 missing
But using eachcol
to impute each column like this doesn't work:
replace!(eachcol(df), missing => mean(skipmissing(eachcol(df))))
MethodError: no method matching _replace!(::Base.var"#new#352"{Tuple{Pair{Missing, Vector{Missing}}}}, ::DataFrames.DataFrameColumns{DataFrame}, ::DataFrames.DataFrameColumns{DataFrame}, ::Int64)
So I was wondering if anyone knows how to replace the missing values in each column with their mean value in Julia
?
So, replace!
is incorrect. To see why note that:
julia> x = [1, missing, 2]
3-element Vector{Union{Missing, Int64}}:
1
missing
2
julia> replace!(x, missing => mean(skipmissing(x)))
ERROR: InexactError: Int64(1.5)
You need the following (assuming you want an in-place operation):
julia> transform!(df, All() .=> (x -> replace(x, missing => mean(skipmissing(x)))) => identity)
3×3 DataFrame
Row │ x y z
│ Float64 Float64 Float64
─────┼───────────────────────────
1 │ 1.0 2.5 4.0
2 │ 2.0 2.0 4.0
3 │ 3.0 3.0 4.0
Note the promotion of column eltypes.
This is a general approach. An alternative that is valid in your case as you want to replace all columns would be:
julia> mapcols!(df) do x
replace(x, missing => mean(skipmissing(x)))
end
3×3 DataFrame
Row │ x y z
│ Float64 Float64 Float64
─────┼───────────────────────────
1 │ 1.0 2.5 4.0
2 │ 2.0 2.0 4.0
3 │ 3.0 3.0 4.0
Yet another way to do it would be:
julia> df[!, :] .= coalesce.(df, permutedims((mean∘skipmissing).(eachcol(df))))
3×3 DataFrame
Row │ x y z
│ Int64 Float64 Float64
─────┼─────────────────────────
1 │ 1 2.5 4.0
2 │ 2 2.0 4.0
3 │ 3 3.0 4.0
Note a difference. Broadcasting does not change eltype of column :x
as it is not needed because the mean happens to be integer.