I was trying to fill NA values from column 'Income' with the mean of the column based on the gender column, so if 'M' has a NA value, insert one from the mean of all 'M' people (except NA values):
GENDER|INCOME
M| 1
M| 2
M| 3
M| NA
F| 4
F| 4
F| 5
F| NA
GENDER|INCOME
M| 1
M| 2
M| 3
M| 2
F| 4
F| 4
F| 5
F| 4.33
I tried the aggregate method but it gives you a list that, if I try to input in my dataset, it gives me an error in order to write a CSV. It works perfectly to get the values, but I don't know how to input these values but manually. This is my code so far:
df$INCOME[is.na(df$INCOME)] <- aggregate(.~df$GENDER, df, FUN = mean, na.rm = TRUE)
Thank you for any help!
If we want the mean
to fill the missing values (NA
) for each 'GENDER', then use na.aggregate
from zoo
which does this automatically (FUN
by default is mean
)
library(zoo)
library(dplyr)
df <- df %>%
group_by(GENDER) %>%
mutate(INCOME = na.aggregate(INCOME)) %>%
ungroup
Or another option is to replace
the NA
elements with the mean
df %>%
group_by(GENDER) %>%
mutate(INCOME = replace(INCOME, is.na(INCOME),
mean(INCOME, na.rm = TRUE))) %>%
ungroup
aggregate
returns a data.frame
of summarised value columns and the group column. Thus, assigning on the NA
elements with the whole data.frame may not work i.e. we need to extract the column
df_mean <- aggregate(INCOME ~ GENDER, df, FUN = mean,
na.rm = TRUE, na.action = NULL)
df_mean
GENDER INCOME
1 F 4.333333
2 M 2.000000
df_mean$INCOME
[1] 4.333333 2.000000
df$INCOME[is.na(df$INCOME)] <- df_mean$INCOME
Extracting the INCOME
column and assign could be buggy approach because
NA
for each 'GENDER' could be potentially differentTherefore, an option is also to merge
the summarised data and then change the values based on the NA
elements
transform(merge(df, df_mean, by = "GENDER"),
INCOME = ifelse(is.na(INCOME.x), INCOME.y, INCOME.x))[c('GENDER', 'INCOME')]
GENDER INCOME
1 F 4.000000
2 F 4.000000
3 F 5.000000
4 F 4.333333
5 M 1.000000
6 M 2.000000
7 M 3.000000
8 M 2.000000