I would like to impute NA's in a dataframe with means of observed data in each country. In other words, while dealing with NAs, the values in the specific country should be taken into consideration. For instance;
Date Country Battles Riots
March 2018 Afghanistan 380 NA
March 2018 Yemen 88 5
March 2018 Mali 45 NA
April 2018 Afghanistan 350 NA
April 2018 Yemen NA 66
April 2018 Mali 67 NA
May 2018 Afghanistan NA 7
May 2018 Yemen NA NA
May 2018 Mali NA 6
I have used the following code, but obviously it calculates the means without taking the country specific information.
for(i in 6:ncol(my_data)) {
my_data[ , i][is.na(my_data[ , i])] <- mean(my_data[ , i], na.rm = TRUE)
}
Many thanks in advance.
You could use:
library(dplyr)
library(tidyr)
df %>%
group_by(Country) %>%
mutate(across(c(Battles, Riots), ~ replace_na(.x, mean(.x, na.rm = TRUE)))) %>%
ungroup()
which returns
Date Country Battles Riots
<chr> <chr> <dbl> <dbl>
1 March 2018 Afghanistan 380 7
2 March 2018 Yemen 88 5
3 March 2018 Mali 45 6
4 April 2018 Afghanistan 350 7
5 April 2018 Yemen 88 66
6 April 2018 Mali 67 6
7 May 2018 Afghanistan 365 7
8 May 2018 Yemen 88 35.5
9 May 2018 Mali 56 6
structure(list(Date = c("March 2018", "March 2018", "March 2018",
"April 2018", "April 2018", "April 2018", "May 2018", "May 2018",
"May 2018"), Country = c("Afghanistan", "Yemen", "Mali", "Afghanistan",
"Yemen", "Mali", "Afghanistan", "Yemen", "Mali"), Battles = c(380,
88, 45, 350, NA, 67, NA, NA, NA), Riots = c(NA, 5, NA, NA, 66,
NA, 7, NA, 6)), row.names = c(NA, -9L), class = c("tbl_df", "tbl",
"data.frame"))