Search code examples
rdata.tablepivotsubtotal

R: pivoting & subtotals in data.table?


Pivoting and subtotals are common auxiliary steps in spreadsheets and SQL.

Assume a data.table with the fields date, myCategory, revenue. Assume that you want to know the proportion of day revenue of all revenue and the proportion of day revenue within different subgroup such that

 b[,{
    #First auxiliary variable of all revenue
    totalRev = sum(revenue)                     #SUBGROUP OF ALL REV

    #Second auxiliary variable of revenue by date, syntax wrong! How to do this?
    {totalRev_date=sum(revenue), by=list(date)} #DIFFERENT SUBGROUP, by DATE's rev

    #Within the subgroup by date and myCategory, we will use 1st&2nd auxiliary vars
    .SD[,.(Revenue_prop_of_TOT=revenue/totalRev,
          ,Revenue_prop_of_DAY=revenue/totalRev_date)    ,by=list(myCategory,date)]
    },]

where we need to compute the auxiliary sums, all revenue of specific day and all revenue of whole history.

The end result should look like this:

date            myCategory       Revenue_prop_of_TOT         Revenue_prop_of_DAY
2019-01-01      Cat1             0.002                       0.2
...

where you see that the auxiliary variables are only help functions.

How can you pivot and compute subtotals within R data.table?


Solution

  • Another option using data.table::cube:

    cb <- cube(DT, sum(value), by=c("date","category"), id=TRUE)
    
    cb[grouping==0L, .(date, category,
    
        PropByDate = V1 / cb[grouping==1L][.SD, on="date", x.V1],
    
        PropByCategory = V1 / cb[grouping==2L][.SD, on="category", x.V1],
    
        PropByTotal = V1 / cb[grouping==3L, V1]
    )]
    

    output:

       date category PropByDate PropByCategory PropByTotal
    1:    1        1  0.3333333      0.2500000         0.1
    2:    1        2  0.6666667      0.3333333         0.2
    3:    2        1  0.4285714      0.7500000         0.3
    4:    2        2  0.5714286      0.6666667         0.4
    

    data:

    DT <- data.table(date=c(1, 1, 2, 2), category=c(1, 2, 1, 2), value=1:4)
    
    #   date category value
    #1:    1        1     1
    #2:    1        2     2
    #3:    2        1     3
    #4:    2        2     4