Search code examples
raggregatena

Get values from aggregate method into NA values from other column condition in R


I was trying to fill NA values from column 'Income' with the mean of the column based on the gender column, so if 'M' has a NA value, insert one from the mean of all 'M' people (except NA values):

    GENDER|INCOME
         M|     1
         M|     2
         M|     3
         M|    NA
         F|     4
         F|     4
         F|     5
         F|    NA


    GENDER|INCOME
         M|     1
         M|     2
         M|     3
         M|     2
         F|     4
         F|     4
         F|     5
         F|  4.33

I tried the aggregate method but it gives you a list that, if I try to input in my dataset, it gives me an error in order to write a CSV. It works perfectly to get the values, but I don't know how to input these values but manually. This is my code so far:

df$INCOME[is.na(df$INCOME)] <- aggregate(.~df$GENDER, df, FUN = mean, na.rm = TRUE)

Thank you for any help!


Solution

  • If we want the mean to fill the missing values (NA) for each 'GENDER', then use na.aggregate from zoo which does this automatically (FUN by default is mean)

    library(zoo)
    library(dplyr)
    df <- df %>% 
      group_by(GENDER) %>% 
      mutate(INCOME = na.aggregate(INCOME)) %>%
      ungroup
    

    Or another option is to replace the NA elements with the mean

    df %>%
       group_by(GENDER) %>%
       mutate(INCOME = replace(INCOME, is.na(INCOME), 
            mean(INCOME, na.rm = TRUE))) %>%
     ungroup
    

    aggregate returns a data.frame of summarised value columns and the group column. Thus, assigning on the NA elements with the whole data.frame may not work i.e. we need to extract the column

    df_mean <- aggregate(INCOME ~ GENDER, df, FUN = mean,
       na.rm = TRUE, na.action = NULL)
    df_mean
       GENDER   INCOME
    1      F 4.333333
    2      M 2.000000
    df_mean$INCOME
    [1] 4.333333 2.000000
    df$INCOME[is.na(df$INCOME)] <- df_mean$INCOME
    

    Extracting the INCOME column and assign could be buggy approach because

    1. the 'GENDER' grouping may be in a different order than in the original data
    2. the number of NA for each 'GENDER' could be potentially different

    Therefore, an option is also to merge the summarised data and then change the values based on the NA elements

    transform(merge(df, df_mean, by = "GENDER"), 
       INCOME = ifelse(is.na(INCOME.x), INCOME.y, INCOME.x))[c('GENDER', 'INCOME')]
      GENDER   INCOME
    1      F 4.000000
    2      F 4.000000
    3      F 5.000000
    4      F 4.333333
    5      M 1.000000
    6      M 2.000000
    7      M 3.000000
    8      M 2.000000