Search code examples
rtime-seriesdata.tablerowwise

row-wise first/last occurrences from column series in data.table


As a follow-up to this oldie but goodie: efficient row-wise operations on a data.table

I have some data that (unfortunately) look like:

library('data.table')
set.seed(1234)
m <- 5
n <- 7
rb <- function() runif(m,1000,2000) * rbinom(m,1,0.5)   
series_col_nms <- paste0('YearNo',1:n)    
rev <- data.table(cust_id = paste0('CustNo',1:m), 
                other_stuff = sample(letters,m, replace=TRUE))
for(col in series_col_nms){
  set(rev, j=col, value=rb())
}
setkey(rev, cust_id)

One row per customer with various columns including yearly revenue for year 1, 2, ...

I would like to get the year indices for the first and last years with any revenue from each customer.

I can produce the desired result, but with a somewhat hacky join:

years_active <- rev[, which(.SD>0), .SDcols = series_col_nms, 
                    keyby=cust_id][, .(min_year_active = min(V1),
                                       max_year_active = max(V1)), keyby=cust_id]
years_active[rev]

These attempts to get the minimum index fail:

rev[, apply(.SD, 1, function(x) min(which(x>0))), .SDcols=series_col_nms, by=cust_id] # returns data type error    
rev[, do.call(pmin, lapply(.SD, function(x) which(x>0))), .SDcols=series_col_nms, by=cust_id] # returns empty

What is the data.table way to do this?


Solution

  • When you want to operate on many columns by row, the common practice it to melt your data set first and then operate on a single column.

    In your case, a relatively straightforward solution could be something like

    res <- melt(rev, id = 1:2)[, 
             as.list({
                temp <- value != 0
                if (any(temp)) range(which(temp)) else rep(NA_integer_, 2)
             }), 
            by = cust_id]
    
    rev[, c("Min", "Max") := res[, .(V1, V2)]]
    rev
    #    cust_id other_stuff  YearNo1 YearNo2  YearNo3  YearNo4  YearNo5  YearNo6  YearNo7 Min Max
    # 1: CustNo1           c 1640.311       0    0.000 1759.671    0.000 1503.933    0.000   1   6
    # 2: CustNo2           q 1009.496       0    0.000 1201.248    0.000    0.000 1308.095   1   7
    # 3: CustNo3           p    0.000       0    0.000    0.000 1484.991    0.000    0.000   5   5
    # 4: CustNo4           q 1666.084       0 1831.345 1992.150 1243.929    0.000 1051.647   1   7
    # 5: CustNo5           w    0.000       0    0.000    0.000    0.000    0.000    0.000  NA  NA
    

    A cleaner version but with warnings could be

    melt(rev, id = 1:2)[, as.list(as.integer(range(which(value != 0)))), by = cust_id]