Search code examples
rnamissing-dataimputation

How to replace NAs by group mean?


I have two data frames, one containing the raw data (and lots of NAs) and one containing the means of the raw data at different intervals.

I would like to replace the NAs with the means at those intervals, while retaining the non-NAs.

I have googled a lot on "impute missing data r", or "replace missing data r", but have yet to find any solution that seems to fit, they all either seem to replace data with 0:s, or use a way to complex method for this assignment, like using the MICE package.

Code example: This is the head of the first DF, with the raw data. As you can see, they are all NA for the first day.

   steps       date interval
1    NA 2012-10-01        0
2    NA 2012-10-01        5
3    NA 2012-10-01       10
4    NA 2012-10-01       15
5    NA 2012-10-01       20
6    NA 2012-10-01       25
...

The second data frame head contains the mean steps per interval, like this:

  steps interval
1 1.72          0
2 0.340         5
3 0.132        10
4 0.151        15
5 0.0755       20
6 2.09         25
...

Now, what I am looking for is to be able to fill the NAs with the mean steps for the relevant interval, so it looks like this:

   steps       date interval
1  1.72  2012-10-01        0
2 0.340  2012-10-01        5
3 0.132  2012-10-01       10
4 0.151  2012-10-01       15
5 0.0755 2012-10-01       20
6 2.09   2012-10-01       25
...

Any tips, or links for resources I have missed? As this is a course assignment and I mostly wanna learn, any help without doing the assignment for me would be much appreciated! =)

edit: Also, since this is my first question on Stack Overflow, any comments on how to improve my question-making is also appreciated!


Solution

  • There are many ways to do that in R. For example

    # generate dataframe with some interval vaulues
    df1 <- data.frame(interval= rep(seq(0, 25, 5), 5))
    # add a steps column
    df1$steps <- 1:nrow(df)
    # copy the dataframe
    df2 <- df1
    # replace some steps values with missings in df1
    df1$steps[c(1,2,5, 14)] <- NA
    
    # sapply goes thru every unique interval...
    sapply(df1$interval, function(interval_i){
      # replace missing steps of interval_i in df1 with the steps mean of interval_i of df2
      df1$steps[is.na(df1$steps) & df1$interval == interval_i] <<- mean(df2$steps[df2$interval == interval_i], na.rm= TRUE)
    # you must use <<- not <- to assign it to df outside of sapply
    })