Search code examples
rplyrmissing-datamedian

imputing data with median by date in R


I need to replace the missing values in the field "steps" by the median of "steps" calculated over that particular day (group by "date") with NA values removed. I have already referred this thread but my NA values aren't replaced. Can somebody help me find out where am I going wrong? I would prefer using base package/data table/plyr. The dataset looks approx. like this:-

        steps      date interval
    1:    NA 2012-10-01        0
    2:    NA 2012-10-01        5
    3:    NA 2012-10-01       10
    4:    NA 2012-10-01       15
    5:    NA 2012-10-01       20
   ---                          
17564:    NA 2012-11-30     2335
17565:    NA 2012-11-30     2340
17566:    NA 2012-11-30     2345
17567:    NA 2012-11-30     2350
17568:    NA 2012-11-30     2355

The structure and summary of the dataset(activity) are as shown below

 #str(activity)  
 Classes ‘data.table’ and 'data.frame': 17568 obs. of  3 variables:
     $ steps   : int  NA NA NA NA NA NA NA NA NA NA ...
     $ date    : Date, format: "2012-10-01" "2012-10-01" "2012-10-01" ...
     $ interval: int  0 5 10 15 20 25 30 35 40 45 ...

#summary(activity)
         steps             date               interval     
     Min.   :  0.00   Min.   :2012-10-01   Min.   :   0.0  
     1st Qu.:  0.00   1st Qu.:2012-10-16   1st Qu.: 588.8  
     Median :  0.00   Median :2012-10-31   Median :1177.5  
     Mean   : 37.38   Mean   :2012-10-31   Mean   :1177.5  
     3rd Qu.: 12.00   3rd Qu.:2012-11-15   3rd Qu.:1766.2  
     Max.   :806.00   Max.   :2012-11-30   Max.   :2355.0  
     NA's   :2304     

Things I have tried:

Datatable method:

activityrepNA<-activity[,steps := ifelse(is.na(steps), median(steps, na.rm=TRUE), steps), by=date]
summary(activityrepNA)
     steps             date               interval     
 Min.   :  0.00   Min.   :2012-10-01   Min.   :   0.0  
 1st Qu.:  0.00   1st Qu.:2012-10-16   1st Qu.: 588.8  
 Median :  0.00   Median :2012-10-31   Median :1177.5  
 Mean   : 37.38   Mean   :2012-10-31   Mean   :1177.5  
 3rd Qu.: 12.00   3rd Qu.:2012-11-15   3rd Qu.:1766.2  
 Max.   :806.00   Max.   :2012-11-30   Max.   :2355.0  
 NA's   :2304 

Using ave

activity$steps[is.na(activity$steps)] <- with(activity, ave(steps,date, FUN = function(x) median(x, na.rm = TRUE)))[is.na(activity$steps)]
> summary(activity)
     steps             date               interval     
 Min.   :  0.00   Min.   :2012-10-01   Min.   :   0.0  
 1st Qu.:  0.00   1st Qu.:2012-10-16   1st Qu.: 588.8  
 Median :  0.00   Median :2012-10-31   Median :1177.5  
 Mean   : 37.38   Mean   :2012-10-31   Mean   :1177.5  
 3rd Qu.: 12.00   3rd Qu.:2012-11-15   3rd Qu.:1766.2  
 Max.   :806.00   Max.   :2012-11-30   Max.   :2355.0  
 NA's   :2304

Attempt at ddply

cleandatapls<-ddply(activity, 
+       .(as.character(date)), 
+       transform, 
+       steps=ifelse(is.na(steps), median(steps, na.rm=TRUE), steps))
> summary(cleandatapls)
 as.character(date)     steps             date               interval     
 Length:17568       Min.   :  0.00   Min.   :2012-10-01   Min.   :   0.0  
 Class :character   1st Qu.:  0.00   1st Qu.:2012-10-16   1st Qu.: 588.8  
 Mode  :character   Median :  0.00   Median :2012-10-31   Median :1177.5  
                    Mean   : 37.38   Mean   :2012-10-31   Mean   :1177.5  
                    3rd Qu.: 12.00   3rd Qu.:2012-11-15   3rd Qu.:1766.2  
                    Max.   :806.00   Max.   :2012-11-30   Max.   :2355.0  
                    NA's   :2304   

Aggregate for calculating median

whynoclean<-aggregate(activity,by=list(activity$date),FUN=median,na.rm=TRUE)
> summary(whynoclean)
    Group.1               steps        date               interval   
 Min.   :2012-10-01   Min.   :0   Min.   :2012-10-01   Min.   :1178  
 1st Qu.:2012-10-16   1st Qu.:0   1st Qu.:2012-10-16   1st Qu.:1178  
 Median :2012-10-31   Median :0   Median :2012-10-31   Median :1178  
 Mean   :2012-10-31   Mean   :0   Mean   :2012-10-31   Mean   :1178  
 3rd Qu.:2012-11-15   3rd Qu.:0   3rd Qu.:2012-11-15   3rd Qu.:1178  
 Max.   :2012-11-30   Max.   :0   Max.   :2012-11-30   Max.   :1178  
                      NA's   :8                     

EDIT output for the code using mutate

activity %>% group_by(date) %>% mutate(steps = replace(steps, is.na(steps), median(steps, na.rm = T)))
Source: local data table [17,568 x 3]

   steps       date interval
1     NA 2012-10-01        0
2     NA 2012-10-01        5
3     NA 2012-10-01       10
4     NA 2012-10-01       15
5     NA 2012-10-01       20
6     NA 2012-10-01       25
7     NA 2012-10-01       30
8     NA 2012-10-01       35
9     NA 2012-10-01       40
10    NA 2012-10-01       45
..   ...        ...      ... 

UPDATE:

Steven Beaupre helped me realize that my approach for the imputing was flawed as there were specific dates having only NA values that were causing the problem as the median of NA's is NA. Used another suggested approach.


Solution

  • Try:

    library(dplyr)
    df %>% 
      group_by(date) %>% 
      mutate(steps = ifelse(is.na(steps), median(steps, na.rm = T), steps))
    

    If for a given date, all steps are NAs, you could replace them with 0:

    df %>% 
      group_by(date) %>% 
      mutate(steps = ifelse(all(is.na(steps)), 0,
                            ifelse(is.na(steps), median(steps, na.rm = T), steps)))