Search code examples
rcountaggregateplyrr-factor

Count all observed factor levels, also those that aren't observed


We have the DF

df <- data.frame(group=as.factor(rep(c("UP","DOWN"),6)),variables=(rep(c("sex","smoke","sport"),each=4))
             ,values=as.factor(c(1,1,1,0  ,1,1,0,0, 1,1,1,1)))

   group variables values
1     UP       sex      1
2   DOWN       sex      1
3     UP       sex      1
4   DOWN       sex      0
5     UP     smoke      1
6   DOWN     smoke      1
7     UP     smoke      0
8   DOWN     smoke      0
9     UP     sport      1
10  DOWN     sport      1
11    UP     sport      1
12  DOWN     sport      1
> 

Now I want to know all counts of all levels

library(plyr)

This command does almost perfectly what I want

count(df, c("variables", "group", "values"))

 variables group values freq
1      sex  DOWN     0    1
2      sex  DOWN     1    1
3      sex    UP     1    2
4    smoke  DOWN     0    1
5    smoke  DOWN     1    1
6    smoke    UP     0    1
7    smoke    UP     1    1
8    sport  DOWN     1    2
9    sport    UP     1    2

I also want to count the factor levels that arent observed. Like I did per hand in the following output.

 variables group values freq
1      sex  DOWN     0    1
2      sex  DOWN     1    1
3      sex    UP     0    0  <-- 
4      sex    UP     1    2
5    smoke  DOWN     0    1
6    smoke  DOWN     1    1
7    smoke    UP     0    1  
8    smoke    UP     1    1
9    sport  DOWN     0    0  <--
10   sport  DOWN     1    2
11   sport    UP     0    0  <--
12   sport    UP     1    2

How can I achieve the above output?


Solution

  • You can do this also with data.table with less lines of code:

    library(data.table)
    dt <- setDT(df)
    cj <- CJ(dt$variables, dt$group, dt$values, unique = TRUE)
    dt[, .N, keyby = c("variables", "group", "values")][cj][is.na(N), N := 0]
    
    print(dt)    
        variables group values N
     1:       sex  DOWN      0 1
     2:       sex  DOWN      1 1
     3:       sex    UP      0 0
     4:       sex    UP      1 2
     5:     smoke  DOWN      0 1
     6:     smoke  DOWN      1 1
     7:     smoke    UP      0 1
     8:     smoke    UP      1 1
     9:     sport  DOWN      0 0
    10:     sport  DOWN      1 2
    11:     sport    UP      0 0
    12:     sport    UP      1 2
    

    Explanation

    setDT() converts a data.frame to a data.table by reference, ie, without copying.

    CJ() is a cross join. It forms a data.table from the cross product of the vectors. Thus, it's the data.table version of expand.grid. The parameter unique = TRUE is a convenient alternative to wrapping each argument in level() or unique().

    The counting by groups is done with dt[, .N, keyby = c("variables", "group", "values")]:

       variables group values N
    1:       sex  DOWN      0 1
    2:       sex  DOWN      1 1
    3:       sex    UP      1 2
    4:     smoke  DOWN      0 1
    5:     smoke  DOWN      1 1
    6:     smoke    UP      0 1
    7:     smoke    UP      1 1
    8:     sport  DOWN      1 2
    9:     sport    UP      1 2
    

    Now, dt[, .N, keyby = c("variables", "group", "values")][cj] (right) joins the CJ() result with all possible combinations.

    Finally, [is.na(N), N := 0] replaces all NAs in column N by 0.