Search code examples
raggregatenacbind

Using Aggregate and Cbind- NA issue?


I am using cbind to find the mean of 3 different columns. However I get different answers for the means when I do:

  DFNEW <- aggregate(cbind(X1, X2, X3)~Y, DF, FUN=mean)

vs

  DFNEW <- aggregate(cbind(X1, X2)~Y, DF, FUN=mean)

The means of X1 and X2 are different when I run command 1 and when I run command 2. X1, X2, and X3 all have different numbers of NA arguments- is that the reason? Part of this may also be that I'm not entirely sure what cbind is doing in this case.


Solution

  • I guess the reason why you are getting different result is because by default, na.action=na.omit for the formula interface. So, the rows with NAs are omitted and not used for the calculation of mean. When we use different combinations of columns, different rows could be deleted based on the occurrence of NA. By specifying na.action=NULL, the rows will not get deleted and we can remove the NA values while calculating the mean by using the argument na.rm=TRUE in the mean function.

      aggregate(cbind(X1, X2)~Y, DF, FUN=mean, na.rm=TRUE, na.action=NULL)
       #Y       X1       X2
       #1 A 3.625000 6.375000
       #2 B 3.000000 6.750000
       #3 C 4.142857 7.166667
    
      aggregate(cbind(X1, X2, X3)~Y, DF, FUN=mean, na.rm=TRUE, na.action=NULL)
      #  Y       X1       X2       X3
      #1 A 3.625000 6.375000 2.875000
      #2 B 3.000000 6.750000 1.500000
      #3 C 4.142857 7.166667 1.333333
    

    The results we get from the above will be the same below i.e. without using the formula interface

    aggregate(DF[,c("X1", "X2")], list(Y=DF[,"Y"]), FUN= mean, na.rm=TRUE)
    #  Y       X1       X2
    # 1 A 3.625000 6.375000
    # 2 B 3.000000 6.750000
    # 3 C 4.142857 7.166667
    
    
    aggregate(DF[,c("X1", "X2", "X3")], list(Y=DF[,"Y"]), FUN= mean, na.rm=TRUE)
    #  Y       X1       X2       X3
    #1 A 3.625000 6.375000 2.875000
    #2 B 3.000000 6.750000 1.500000
    #3 C 4.142857 7.166667 1.333333
    

    If you want some alternatives, you could use dplyr

     library(dplyr)
      DF %>% 
         group_by(Y) %>% 
         summarise_each(funs(mean=mean(., na.rm=TRUE)))
      # Source: local data frame [3 x 4]
    
      #  Y       X1       X2       X3
      #1 A 3.625000 6.375000 2.875000
      #2 B 3.000000 6.750000 1.500000
      #3 C 4.142857 7.166667 1.333333
    
       DF %>% 
          group_by(Y) %>%
          summarise_each(funs(mean=mean(., na.rm=TRUE)), X1, X2)
       #Source: local data frame [3 x 3]
    
       #  Y       X1       X2
       #1 A 3.625000 6.375000
       #2 B 3.000000 6.750000
       #3 C 4.142857 7.166667
    

    data

    set.seed(42)
    DF <- data.frame(X1=sample(c(NA, 1:5), 25, replace=TRUE), 
        X2= sample(c(NA, 1:10), 25, replace=TRUE), X3= sample(c(NA,0:5), 25, 
        replace=TRUE), Y=sample(LETTERS[1:3], 25, replace=TRUE))