I am new to 'R' program and currently want to deal with the missing values. Basically, I have a dataset with a few columns and there are missing values in the 'Purchase' column.
I want to impute the mean of Purchase values based on 'Master_Category' column for the missing values.
(Python code)
# generate missing Purchase values
miss_Purch_rows = dataset['Purchase'].isnull()
# Check Purchase values from the grouping by the newly created Master_Product_Category column
categ_mean = dataset.groupby(['Master_Product_Category'])['Purchase'].mean()
# Impute mean Purchase value based on Master_Product_Category column
dataset.loc[miss_Purch_rows,'Purchase'] = dataset.loc[miss_Purch_rows,'Master_Product_Category'].apply(lambda x: categ_mean.loc[x])
I am looking for a similar code in 'R-program' to impute missing values by mean and relating to another column.
Sample data of the dataset is as follows;
User_ID Product_ID Gender Age Occupation Marital_Status Master_Category Purchase
1 1000001 P00000142 F 0-17 10 0 345 13650
2 1000001 P00004842 F 0-17 10 0 3412 13645
3 1000001 P00025442 F 0-17 10 0 129 15416
4 1000001 P00051442 F 0-17 10 0 8170 9938
5 1000001 P00051842 F 0-17 10 0 480 2849
6 1000001 P00057542 F 0-17 10 0 345 NA
7 1000001 P00058142 F 0-17 10 0 3412 11051
8 1000001 P00058242 F 0-17 10 0 3412 NA
9 1000001 P00059442 F 0-17 10 0 6816 16622
10 1000001 P00064042 F 0-17 10 0 3412 8190
I have tried ;
with(dataset, sapply(X = Purchase, INDEX = Master_Category, FUN = mean, na.rm = TRUE))
But it doesn't seem to work.
This type of per-group operation is usually easy to do via the tidyverse set of packages:
First, we read in your example data:
txt <- 'User_ID Product_ID Gender Age Occupation Marital_Status Master_Category Purchase
1000001 P00000142 F 0-17 10 0 345 13650
1000001 P00004842 F 0-17 10 0 3412 13645
1000001 P00025442 F 0-17 10 0 129 15416
1000001 P00051442 F 0-17 10 0 8170 9938
1000001 P00051842 F 0-17 10 0 480 2849
1000001 P00057542 F 0-17 10 0 345 NA
1000001 P00058142 F 0-17 10 0 3412 11051
1000001 P00058242 F 0-17 10 0 3412 NA
1000001 P00059442 F 0-17 10 0 6816 16622
1000001 P00064042 F 0-17 10 0 3412 8190'
df <- read.table(text = txt, header = T)
Then we group by "Master_Category", and fill in any NA
values with the group mean using ifelse
inside of mutate
:
library(tidyverse)
df.new <- df %>%
group_by(Master_Category) %>%
mutate(Purchase = ifelse(is.na(Purchase), mean(Purchase, na.rm = T), Purchase))
User_ID Product_ID Gender Age Occupation Marital_Status Master_Category Purchase
<int> <fct> <lgl> <fct> <int> <int> <int> <dbl>
1 1000001 P00000142 FALSE 0-17 10 0 345 13650
2 1000001 P00004842 FALSE 0-17 10 0 3412 13645
3 1000001 P00025442 FALSE 0-17 10 0 129 15416
4 1000001 P00051442 FALSE 0-17 10 0 8170 9938
5 1000001 P00051842 FALSE 0-17 10 0 480 2849
6 1000001 P00057542 FALSE 0-17 10 0 345 13650
7 1000001 P00058142 FALSE 0-17 10 0 3412 11051
8 1000001 P00058242 FALSE 0-17 10 0 3412 10962
9 1000001 P00059442 FALSE 0-17 10 0 6816 16622
10 1000001 P00064042 FALSE 0-17 10 0 3412 8190