What I want to do is calculate the mean when I have more than one observation with the same ID, and the median when there is only one observation with the same ID.
car <- c("Mazda", "Toyota Corolla", "Duster",
"Cadillac", "Maserati", "Mazda", "Ford", "Mazda", "Toyota Corolla")
x <- c(3,4,2,3,NA,4,2,NA,3)
x_2 <- c(3,4,2,3,NA,4,2,NA,3)
cars <- data.frame(car,x,x_2)
What I want is to take the ID of my data frame that has missing values, In this case, only Maserati and Mazda have missing values, I want that R takes the missing values of the Maserati and replace them by the mean between all Maserati in the sample, if this condition is not met I want to take the average of the column, while for Mazda I want that R does the same since there is more than one Mazda I want to take the average between all Mazda car and replace the missing values with the mean between the Mazda cars.
This is what I did
library(dplyr)
library(zoo)
cars <- cars %>% group_by(car) %>% mutate(x = na.aggregate(x))
cars$x <- ifelse(is.na(cars$x)==TRUE, median(x),x)
The third line creates the mean of the Mazda, but the second line does not create the median of the column and assigns to the cars with missing values. In this case the Maserati. Also, is there a way that I can do this for all my data frame without run this line of code for each variable.
This is what I need:
First replace NA
with mean
of each car
. If NA
still remains replace it with median
of the column.
library(dplyr)
cars %>%
group_by(car) %>%
mutate(across(c(x, x_2), ~replace(., is.na(.), mean(., na.rm = TRUE)))) %>%
ungroup() %>%
mutate(across(c(x, x_2), ~replace(., is.na(.), median(., na.rm = TRUE))))
# car x x_2
# <chr> <dbl> <dbl>
#1 Mazda 3 3
#2 Toyota Corolla 4 4
#3 Duster 2 2
#4 Cadillac 3 3
#5 Maserati 3 3
#6 Mazda 4 4
#7 Ford 2 2
#8 Mazda 3.5 3.5
#9 Toyota Corolla 3 3