Search code examples
rdata.tabledcast

why is dcast so impossible to pass a non-aggregate function?


I am using the data.table package for a table like this:

DT <- data.table(id=rep(1:100, each=50),
                 grp=rep(letters[1:4], each=1250),
                 time=rep(1:50,100),
                 outcome=rnorm(5000),
                 seconds=rep(1:500,10),
                 weights=rnorm(5000),
                 response=rep(1:200, each=25),
                 key=c("grp", "time"))

I would like to create a new (possibly rbindlisted) data table of some summary statistics from this table. I first created two intermediary tables a and b,

a <- DT[, list(mean = weighted.mean(outcome, weights), 
               median=median(outcome),seconds), by=c("grp","time")]
b <- DT[, list(mean=weighted.mean(response, seconds),
               median=median(response)), by=c("grp","time")]

and then am trying to rowbind these together across all groups but still preserve the grouping along the rows. This does not work:

  DTfinal <- data.table(DT$grp, DT$time,
   outcomemean=a$mean, responsemean=b$mean, 
   outcomemedian=a$median, responsemedian=b$median)

I don't think a merge works since a and b have different lengths. Rowbinding a and b also mixes up the different means and medians of a and b, ideally I would like a rbindlist that has some kind of suffix for each column like c(".a",".b").

Update: I get an error (since a and b have different dimension) doing

DTfinal <- rbindlist(setNames(list(a[, c("grp", "time", "mean", "median"),
                                     with = FALSE], 
                                   b[, c("grp", "time", "mean", "median"),
                                     with = FALSE]), 
                                    c("a", "b")),
                                    idcol= "id")

dcast(DTfinal, grp + time ~id, value.var = c('mean', 'median')) 

where it returns

Aggregate function missing, defaulting to 'length'

Solution

  • We can use rbindlist after placing the datasets in a list

    DTfinal <- rbindlist(list(a,b))
    dim(DTfinal)
    #[1] 400   4
    dim(a)
    #[1] 200   4
    dim(b)
    #[1] 200   4
    

    Suppose if both datasets have different number of columns, and we have a vector of column names that we need to keep

    nm1 <- intersect(names(a), names(b))
    rbindlist(list(a[, nm1, with = FALSE], b[, nm1, with = FALSE]), idcol= "id")
    

    Update

    If we need to convert to 'wide' format

    DTfinal <-  rbindlist(setNames(list(a,b), c("a", "b")), idcol= "id")
    dcast(DTfinal, grp + time ~id, value.var = c('mean', 'median'))
    #     grp time       mean_a    mean_b    median_a median_b
    #  1:   a    1   0.52171471  25.99502 -0.06558068       25
    #  2:   a    2   0.36445108  25.99010  0.13518412       25
    #  3:   a    3   0.08993721  25.98522  0.20128790       25
    #  4:   a    4 -64.04617391  25.98039  0.40999376       25
    #  5:   a    5   0.81730847  25.97561 -0.03481697       25
    # ---                                                     
    #196:   d   46   1.62818374 176.67568 -0.26695999      176
    #197:   d   47  -1.45259871 176.67340  0.14893356      176
    #198:   d   48   9.59796683 176.67114 -0.05834959      176
    #199:   d   49  -2.74285453 176.66890 -0.22094347      176
    #200:   d   50   1.22109043 176.66667 -0.08172928      176