I want to impute missings in val
of all ctry
in cat1
with particular ctry
means.
Data example
set.seed(654)
df1 <- data.frame(
year=rep(2000:2005, each=5),
ctry=rep(LETTERS[1:5], 6),
val=rnorm(30)
)
df1$cat <- ifelse(df1$ctry %in% c("A", "B"), 1, 0)
df1[sample(nrow(df1), 12), "val"] <- NA
> head(df1)
year ctry val cat
1 2000 A -0.76031762 1
2 2000 B -0.38970450 1
3 2000 C 1.68962523 0
4 2000 D NA 0
5 2000 E 0.09530146 0
6 2001 A NA 1
First, I get the names of ctry
in cat1
and allocate their means.
cat1 <- as.character(sort(unique(
df1[!is.na(df1$val) & df1$cat == 1, ]
[, 2])))
cat1 <- sapply(cat1, function(x) mean(df1$val[df1$ctry == x], na.rm=TRUE))
> cat1
A B
0.4372003 0.4792314
Now I succeed in manually imputing country by country:
df2 <- df1
df2$val[df2$ctry %in% names(cat1)[1] & is.na(df2$val)] <- cat1[1]
> head(df2)
year ctry val cat
1 2000 A -0.76031762 1
2 2000 B -0.38970450 1
3 2000 C 1.68962523 0
4 2000 D NA 0
5 2000 E 0.09530146 0
6 2001 A -0.49758245 1
But for some reason I can't bring this sapply()
to work, to do the imputation automatically:
> sapply(seq_along(cat1),
+ function(x) df2$val[df2$ctry %in% names(cat1)[x] & is.na(df2$val)] <- cat1[x])
A B
-0.4975825 -0.6139364
The expected output would be a whole data frame with particular imputed means of the countries in category cat1
.
In Base R:
set.seed(654)
df1 <- data.frame(
year=rep(2000:2005, each=5),
ctry=rep(LETTERS[1:5], 6),
val=rnorm(30)
)
df1$cat <- ifelse(df1$ctry %in% c("A", "B"), 1, 0)
df1[sample(nrow(df1), 12), "val"] <- NA
# want:
my_means <- tapply(df1$val, df1$ctry, mean, na.rm = TRUE)
df1$val <- ifelse(is.na(df1$val), my_means[df1$ctry], df1$val)