Search code examples
rdata.tablerowwise

Row operations on selected columns based on substring in data.table


I would like to apply a function to selected columns that match two different substrings. I've found this post related to my question but I couldn't get an answer from there.

Here is a reproducible example with my failed attempt. For the sake of this example, I want to do a row-wise operation where I sum the values from all columns starting with string v and subtract from the average of the values in all columns starting with f.

update: the proposed solution must (a) use the := operator to make the most of data.table fast performance, and (2) be flexible to other operation rather than mean and sum, which I used here just for the sake of simplicity

library(data.table)

# generate data
  dt <- data.table(id= letters[1:5],
                   v1= 1:5, 
                   v2= 1:5,
                   f1= 11:15,
                   f2= 11:15)

dt
#>    id v1 v2 f1 f2
#> 1:  a  1  1 11 11
#> 2:  b  2  2 12 12
#> 3:  c  3  3 13 13
#> 4:  d  4  4 14 14
#> 5:  e  5  5 15 15

# what I've tried
  dt[, Y := sum( .SDcols=names(dt) %like% "v" ) - mean( .SDcols=names(dt) %like% "f" ) by = id]

Solution

  • We melt the dataset into 'long' format, by making use of the measure argument, get the difference between the sum of 'v' and mean of 'f', grouped by 'id', join on the 'id' column with the original dataset and assign (:=) the 'V1' as the 'Y' variable

    dt[melt(dt, measure = patterns("^v", "^f"), value.name = c("v", "f"))[
             , sum(v) - mean(f), id], Y :=V1,  on = .(id)]
    
    dt
    #   id v1 v2 f1 f2  Y
    #1:  a  1  1 11 11 -9
    #2:  b  2  2 12 12 -8
    #3:  c  3  3 13 13 -7
    #4:  d  4  4 14 14 -6
    #5:  e  5  5 15 15 -5
    

    Or another option is with Reduce after creating index or 'v' and 'f' columns

    nmv <- which(startsWith(names(dt), "v"))
    nmf <- which(startsWith(names(dt), "f"))
    l1 <- length(nmv)
    dt[, Y := Reduce(`+`, .SD[, nmv, with = FALSE])- (Reduce(`+`, .SD[, nmf, with  = FALSE])/l1)]