Search code examples
rmeanmissing-datamedian

How calculate the mean by ID when there is more than one individual and the median when there is only one for missing values in R?


What I want to do is calculate the mean when I have more than one observation with the same ID, and the median when there is only one observation with the same ID.

car <- c("Mazda", "Toyota Corolla", "Duster",
         "Cadillac", "Maserati", "Mazda", "Ford", "Mazda", "Toyota Corolla")
x <- c(3,4,2,3,NA,4,2,NA,3)
x_2 <- c(3,4,2,3,NA,4,2,NA,3)
cars <- data.frame(car,x,x_2)

What I want is to take the ID of my data frame that has missing values, In this case, only Maserati and Mazda have missing values, I want that R takes the missing values of the Maserati and replace them by the mean between all Maserati in the sample, if this condition is not met I want to take the average of the column, while for Mazda I want that R does the same since there is more than one Mazda I want to take the average between all Mazda car and replace the missing values with the mean between the Mazda cars.

This is what I did

library(dplyr)
library(zoo)
cars <- cars %>% group_by(car) %>% mutate(x = na.aggregate(x))
cars$x <- ifelse(is.na(cars$x)==TRUE, median(x),x)

The third line creates the mean of the Mazda, but the second line does not create the median of the column and assigns to the cars with missing values. In this case the Maserati. Also, is there a way that I can do this for all my data frame without run this line of code for each variable.

This is what I need:

  1. Calculate the mean by ID and use it as a replacement for missing values: Example: Mazda missing should be replaced by the mean between the two Mazda in the sample
  2. If there are no cars that match with the missing values substitute the missing values with the median of the column: Example there is only one Maserati in my sample therefore we take the median value of the column.
  3. I want to do this for all my data frame and store into the same variables: Example do this for x and x_2 in the same steps and store the new values into the same variables x and x_2

Solution

  • First replace NA with mean of each car. If NA still remains replace it with median of the column.

    library(dplyr)
    
    cars %>%
      group_by(car) %>%
      mutate(across(c(x, x_2), ~replace(., is.na(.), mean(., na.rm = TRUE)))) %>%
      ungroup() %>%
      mutate(across(c(x, x_2), ~replace(., is.na(.), median(., na.rm = TRUE))))
    
    #   car             x   x_2
    #  <chr>          <dbl> <dbl>
    #1 Mazda            3     3  
    #2 Toyota Corolla   4     4  
    #3 Duster           2     2  
    #4 Cadillac         3     3  
    #5 Maserati         3     3  
    #6 Mazda            4     4  
    #7 Ford             2     2  
    #8 Mazda            3.5   3.5
    #9 Toyota Corolla   3     3