Search code examples
rmeanimputation

NAs in a data frame split by country in R


I would like to impute NA's in a dataframe with means of observed data in each country. In other words, while dealing with NAs, the values in the specific country should be taken into consideration. For instance;

Date                Country        Battles       Riots
March 2018          Afghanistan    380            NA
March 2018          Yemen          88             5
March 2018          Mali           45             NA
April 2018          Afghanistan    350            NA
April 2018          Yemen          NA             66
April 2018          Mali           67             NA
May 2018            Afghanistan    NA             7
May 2018            Yemen          NA             NA
May 2018            Mali           NA             6

I have used the following code, but obviously it calculates the means without taking the country specific information.

for(i in 6:ncol(my_data)) {
 my_data[ , i][is.na(my_data[ , i])] <- mean(my_data[ , i], na.rm = TRUE)
 }

Many thanks in advance.


Solution

  • You could use:

    library(dplyr)
    library(tidyr)
    
    df %>% 
      group_by(Country) %>% 
      mutate(across(c(Battles, Riots), ~ replace_na(.x, mean(.x, na.rm = TRUE)))) %>%
      ungroup()
    

    which returns

      Date       Country     Battles Riots
      <chr>      <chr>         <dbl> <dbl>
    1 March 2018 Afghanistan     380   7  
    2 March 2018 Yemen            88   5  
    3 March 2018 Mali             45   6  
    4 April 2018 Afghanistan     350   7  
    5 April 2018 Yemen            88  66  
    6 April 2018 Mali             67   6  
    7 May 2018   Afghanistan     365   7  
    8 May 2018   Yemen            88  35.5
    9 May 2018   Mali             56   6  
    

    Data

    structure(list(Date = c("March 2018", "March 2018", "March 2018", 
    "April 2018", "April 2018", "April 2018", "May 2018", "May 2018", 
    "May 2018"), Country = c("Afghanistan", "Yemen", "Mali", "Afghanistan", 
    "Yemen", "Mali", "Afghanistan", "Yemen", "Mali"), Battles = c(380, 
    88, 45, 350, NA, 67, NA, NA, NA), Riots = c(NA, 5, NA, NA, 66, 
    NA, 7, NA, 6)), row.names = c(NA, -9L), class = c("tbl_df", "tbl", 
    "data.frame"))