As a follow-up to this oldie but goodie: efficient row-wise operations on a data.table
I have some data that (unfortunately) look like:
library('data.table')
set.seed(1234)
m <- 5
n <- 7
rb <- function() runif(m,1000,2000) * rbinom(m,1,0.5)
series_col_nms <- paste0('YearNo',1:n)
rev <- data.table(cust_id = paste0('CustNo',1:m),
other_stuff = sample(letters,m, replace=TRUE))
for(col in series_col_nms){
set(rev, j=col, value=rb())
}
setkey(rev, cust_id)
One row per customer with various columns including yearly revenue for year 1, 2, ...
I would like to get the year indices for the first and last years with any revenue from each customer.
I can produce the desired result, but with a somewhat hacky join:
years_active <- rev[, which(.SD>0), .SDcols = series_col_nms,
keyby=cust_id][, .(min_year_active = min(V1),
max_year_active = max(V1)), keyby=cust_id]
years_active[rev]
These attempts to get the minimum index fail:
rev[, apply(.SD, 1, function(x) min(which(x>0))), .SDcols=series_col_nms, by=cust_id] # returns data type error
rev[, do.call(pmin, lapply(.SD, function(x) which(x>0))), .SDcols=series_col_nms, by=cust_id] # returns empty
What is the data.table
way to do this?
When you want to operate on many columns by row, the common practice it to melt
your data set first and then operate on a single column.
In your case, a relatively straightforward solution could be something like
res <- melt(rev, id = 1:2)[,
as.list({
temp <- value != 0
if (any(temp)) range(which(temp)) else rep(NA_integer_, 2)
}),
by = cust_id]
rev[, c("Min", "Max") := res[, .(V1, V2)]]
rev
# cust_id other_stuff YearNo1 YearNo2 YearNo3 YearNo4 YearNo5 YearNo6 YearNo7 Min Max
# 1: CustNo1 c 1640.311 0 0.000 1759.671 0.000 1503.933 0.000 1 6
# 2: CustNo2 q 1009.496 0 0.000 1201.248 0.000 0.000 1308.095 1 7
# 3: CustNo3 p 0.000 0 0.000 0.000 1484.991 0.000 0.000 5 5
# 4: CustNo4 q 1666.084 0 1831.345 1992.150 1243.929 0.000 1051.647 1 7
# 5: CustNo5 w 0.000 0 0.000 0.000 0.000 0.000 0.000 NA NA
A cleaner version but with warnings could be
melt(rev, id = 1:2)[, as.list(as.integer(range(which(value != 0)))), by = cust_id]