Search code examples
rdata.tabledcast

Using proportion as the aggregate function in Data.Table's dcast


When creating a pivot table using data.table, I am using the dcast function:

dcast(my_data, var1 ~ var2, length)

This gives a table with rows as var1 lables and column as var2 labels and value as count of cells common to particular row and column.

But instead of length I want to calculate the proportion and put it as the value, i.e. {count of cells common to particular row and column} divided by {count of all cells in the column i.e. a particular level of var2}

I have searched and couldn't able to implement it. Any help would be appreciated.


Solution

  • There is a relatively simple solution, but it requires a second step after the dcast().

    First, this is the data I am working on:

    library(data.table)
    
    set.seed(666)
    my_data <- data.table(var1 = sample(letters[1:3], 10, TRUE),
                          var2 = sample(letters[4:6], 10, TRUE))
    
        var1 var2
     1:    c    f
     2:    a    d
     3:    c    d
     4:    a    d
     5:    b    d
     6:    c    f
     7:    c    d
     8:    b    f
     9:    a    e
    10:    a    e
    

    After the dcast

    my_data_dcast <- dcast(my_data, var1 ~ var2, length)
    

    the data looks like this:

       var1 d e f
    1:    a 2 2 0
    2:    b 1 0 1
    3:    c 2 0 2
    

    You can then simply go through all columns and divide each element in a column by the sum of all values in a column.

    Select the columns to transform:

    cols <- unique(my_data$var2)
    

    Go through columns using lapply() on the subset of columns specified in .SDcols and override the values of all cols:

    my_data_dcast[, (cols) := (lapply(.SD, function(col) col / sum(col))),
                  .SDcols = cols]
    

    The final result is this:

       var1   d e         f
    1:    a 0.4 1 0.0000000
    2:    b 0.2 0 0.3333333
    3:    c 0.4 0 0.6666667