I have two data frames, one containing the raw data (and lots of NAs) and one containing the means of the raw data at different intervals.
I would like to replace the NAs with the means at those intervals, while retaining the non-NAs.
I have googled a lot on "impute missing data r", or "replace missing data r", but have yet to find any solution that seems to fit, they all either seem to replace data with 0:s, or use a way to complex method for this assignment, like using the MICE package.
Code example: This is the head of the first DF, with the raw data. As you can see, they are all NA for the first day.
steps date interval
1 NA 2012-10-01 0
2 NA 2012-10-01 5
3 NA 2012-10-01 10
4 NA 2012-10-01 15
5 NA 2012-10-01 20
6 NA 2012-10-01 25
...
The second data frame head contains the mean steps per interval, like this:
steps interval
1 1.72 0
2 0.340 5
3 0.132 10
4 0.151 15
5 0.0755 20
6 2.09 25
...
Now, what I am looking for is to be able to fill the NAs with the mean steps for the relevant interval, so it looks like this:
steps date interval
1 1.72 2012-10-01 0
2 0.340 2012-10-01 5
3 0.132 2012-10-01 10
4 0.151 2012-10-01 15
5 0.0755 2012-10-01 20
6 2.09 2012-10-01 25
...
Any tips, or links for resources I have missed? As this is a course assignment and I mostly wanna learn, any help without doing the assignment for me would be much appreciated! =)
edit: Also, since this is my first question on Stack Overflow, any comments on how to improve my question-making is also appreciated!
There are many ways to do that in R. For example
# generate dataframe with some interval vaulues
df1 <- data.frame(interval= rep(seq(0, 25, 5), 5))
# add a steps column
df1$steps <- 1:nrow(df)
# copy the dataframe
df2 <- df1
# replace some steps values with missings in df1
df1$steps[c(1,2,5, 14)] <- NA
# sapply goes thru every unique interval...
sapply(df1$interval, function(interval_i){
# replace missing steps of interval_i in df1 with the steps mean of interval_i of df2
df1$steps[is.na(df1$steps) & df1$interval == interval_i] <<- mean(df2$steps[df2$interval == interval_i], na.rm= TRUE)
# you must use <<- not <- to assign it to df outside of sapply
})