Search code examples
dataframejuliadata-manipulationmissing-data

Replace missing values by a group mean


In an iris dataframe I replaced some values by missing. I would like to replace the missing values in the column petal_length by the petal_length mean per species. The code below does work (means before and after replacing values are equal), however I suspect there must be a more efficient way to do this which does not loop through every row while only in some rows values are missing. Also, creating a dictionary is probably not necessary in a more optimised solution. Any suggestions for optimising?

using CSV
using DataFrames
using Random
using Statistics
using StatsBase

download("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/iris.csv", "iris.csv")
iris = DataFrame(CSV.File("iris.csv", delim = ","))
allowmissing!(iris)

Random.seed!(20_000)
for i in 1:100
    iris[rand(1:nrow(iris)), rand(1:4)] = missing
end

Random.seed!(20_000)
iris[sample(1:nrow(iris), 10), :species] .= missing

mean_per_species = combine(groupby(iris, :species), :petal_length .=> mean∘skipmissing .=> :mean)
mean_per_species_dict = Dict(mean_per_species.species .=> mean_per_species.mean)

for row in eachrow(iris)
    if ismissing(row.petal_length)
        row.petal_length = mean_per_species_dict[row.species]
    end
end

Solution

  • Is this what you want (using DataFramesMeta.jl):

    julia> @chain iris begin
               groupby(:species)
               @transform(:petal_length = coalesce.(:petal_length, mean(skipmissing(:petal_length))))
           end
    

    (an in-place variant would use @transform!)