Search code examples
rdata.tablecrosstab

How to crosstabulate the missings with data.table


Say we have this toy example:

prueba <- data.table(aa=1:7,bb=c(1,2,NA, NA, 3,1,1),
cc=c(1,2,NA, NA, 3,1,1) , YEAR=c(1,1,1,2,2,2,2))

   aa bb cc YEAR
1:  1  1  1    1
2:  2  2  2    1
3:  3 NA NA    1
4:  4 NA NA    2
5:  5  3  3    2
6:  6  1  1    2
7:  7  1  1    2

I want to create a table with the values of something by YEAR. In this simple example I will just ask for the table that says how many missing and non-missing I have.

This is an ugly way to do it, specifying everything by hand:

prueba[,.(sum(is.na(.SD)),sum(!is.na(.SD))), by=YEAR]

Though it doesn't label automatically the new columns we see it says I have 2 missings and 7 non-missing values for year 1, and ...

   YEAR V1 V2
1:    1  2  7
2:    2  2 10

It works but what I would really like is to be able to use table() or some data.table equivalent command instead of specifying by hand every term. That would be much more efficient if I have many of them or if we don't know them beforehand.

I've tried with:

prueba[,table(is.na(.SD)), by=YEAR]

but it doesn't work, I get this:

   YEAR V1
1:    1  7
2:    1  2
3:    2 10
4:    2  2

How can I get the same format than above?
I've unluckily tried by using as.datable, unlist, lapply, and other things. I think some people use dcast but I don't know how to use it here.
Is there a simple way to do it? My real table is very large. Is it better to use the names of the columns instead of .SD?


Solution

  • You can convert the table to a list if you want it as two separate columns

    prueba[, as.list(table(is.na(.SD))), by=YEAR]
    
    #    YEAR FALSE TRUE
    # 1:    1     7    2
    # 2:    2    10    2
    

    I suggest not using TRUE and FALSE as column names though.

    prueba[, setNames(as.list(table(is.na(.SD))), c('notNA', 'isNA'))
           , by = YEAR]
    
    #    YEAR notNA isNA
    # 1:    1     7    2
    # 2:    2    10    2
    

    Another option is to add a new column and then dcast

    na_summ <- prueba[, table(is.na(.SD)), by = YEAR] 
    na_summ[, vname := c('notNA', 'isNA'), YEAR] 
    dcast(na_summ, YEAR ~ vname, value.var = 'V1')
    
    #    YEAR isNA notNA
    # 1:    1    2     7
    # 2:    2    2    10