Search code examples
rdata.tablecustomizationaggregate-functionsdcast

Customise the aggregate function inside dcast based on the max value of a column in data.table?


I've got a data.table that i'd like to dcast based on three columns (V1, V2, V3). there are, however, some duplicates in V3 and I need an aggregate function that looks at a fourth column V4 and decides for the value of V3 based on maximum value of V4. I'd like to do this without having to aggregate DT separately prior to dcasting. can this aggregation be done in aggregate function of dcast or do I need to aggregate the table separately first?

Here is my data.table DT:

> DT <- data.table(V1 = c('a','a','a','b','b','c')
                 , V2 = c(1,2,1,1,2,1)
                 , V3 = c('st', 'cc', 'B', 'st','st','cc')
                 , V4 = c(0,0,1,0,1,1))
> DT
   V1 V2 V3 V4
1:  a  1 st  0
2:  a  2 cc  0
3:  a  1  B  1 ## --> i want this row to be picked in dcast when V1 = a and V2 = 1 because V4 is largest
4:  b  1 st  0
5:  b  2 st  1
6:  c  1 cc  1

and the dcast function could look something like this:

> dcast(DT
          , V1 ~ V2
          , value.var = "V3"
         #, fun.aggregate = V3[max.which(V4)] ## ?!?!?!??!
        )

My desired output is:

> desired
   V1  1    2
1:  a  B   cc  
2:  b st   st
3:  c cc <NA>

Please note that aggregating DT before dcasting to get rid of the duplicates will solve the issue. I'm just wondering if dcasting can be done with the duplicates.


Solution

  • Here is one option where you take the relevent subset before dcasting:

    DT[order(V4, decreasing = TRUE)
       ][, dcast(unique(.SD, by = c("V1", "V2")), V1 ~ V2, value.var = "V3")] 
    
    #    V1  1    2
    # 1:  a  B   cc
    # 2:  b st   st
    # 3:  c cc <NA>
    

    Alternatively order and use a custom function in dcast():

    dcast(
      DT[order(V4, decreasing = TRUE)], 
      V1 ~ V2, 
      value.var = "V3", 
      fun.aggregate = function(x) x[1]
    )