Search code examples
rdatesumapplylubridate

Why apply sum by column and colSums differs when dataframe contains date format?


In the example bellow, why A is not equal to B ? The problem is related to the Date column since AA=BB, but I can't understand why...

library(lubridate)
MM=data.frame(ID=1:3,Date=ymd(c("2019-11-07","2019-11-07","2019-11-13")),X=c(-1,1,1),Y=c(1,-1,-1))
A=apply(MM,2,function(x) sum(x>0))
B=colSums(MM>0)
A
# ID Date    X    Y
# 3    3    0    0 
B
# ID Date    X    Y 
# 3    3    2    1
AA=apply(MM[,-2],2,function(x) sum(x>0))
BB=colSums(MM[,-2]>0)
AA
# ID  X  Y 
# 3  2  1 
BB
# ID  X  Y 
# 3  2  1 

R version 4.0.3 (2020-10-10)


Solution

  • It's because apply, before applying the function to each column, first of all converts the data.frame into a matrix.

    Therefore:

    > as.matrix(MM)
    
         ID  Date         X    Y   
    [1,] "1" "2019-11-07" "-1" " 1"
    [2,] "2" "2019-11-07" " 1" "-1"
    [3,] "3" "2019-11-13" " 1" "-1"
    

    Everything is converted to character.

    Specifically, you can see that:

    > " 1" > 0
    [1] FALSE
    > "1" > 0
    [1] TRUE
    

    This is because 0 is converted to char.

    "0" comes before " 1" alphabetically because the character 0 gets compared to the blank instead of the character "1".

    The order of strings are decided by comparing character by character 1 by 1. (it's the same reason why "10" > "9" is FALSE, because 1 alphabetically is before 9).

    > sort(c("0", " 1"))
    [1] " 1" "0" 
    > sort(c("0", "1"))
    [1] "0" "1"
    

    As a consequence:

    > as.matrix(MM)>0
           ID Date     X     Y
    [1,] TRUE TRUE FALSE FALSE
    [2,] TRUE TRUE FALSE FALSE
    [3,] TRUE TRUE FALSE FALSE
    

    colSums doesn't do that.


    You can get the same result you would get with colSums if you do:

    > apply(MM>0, 2, sum)
      ID Date    X    Y 
       3    3    2    1