Search code examples
rsumdata.table

r data.table - use string to create a summarized column from n columns


I have som data in a data.table that I want to to sum. Trivial example supplied below. I start with (variable) n-columns (char) and some group variables.

I can convert them (nsk-vars in my example) to numeric/integer beforehand, if that's the way to go. I would like to be able to use a string of columnames to create a new column which is the sum of these(nsk1-nsk3) and handle NA:s. How do I do this in a resonable way?

library(data.table)

x <- data.table(a    = c("GrpA", "GrpB"),
                nsk1 = c("1","3"),
                nsk2 = c(NA,"1"),
                nsk3 = c("3", "4"))

ClNamesStr <- colnames(x)
ClNamesStr <- ClNamesStr[grepl("^nsk", ClNamesStr)]

nskStrSum <- paste(ClNamesStr, collapse = "),as.numeric(")
nskStrSum <- paste("as.numeric(", nskStrSum, "), na.rm = TRUE")

This is the part that I would like to get working.

x[, nsk0 := sum(eval(nskStrSum)), by = a]

The desired output would be this:

a    nsk1 nsk2 nsk3 nsk0
GrpA    1   NA    3    4
GrpB    3    1    4    8

Solution

  • Here is one way to go.

    x[, nsk0 := rowSums(sapply(.SD, function(i) as.numeric(i)), na.rm=TRUE),
      .SDcols=grep("^nsk", names(x)), by = a]
    

    which returns

    x
          a nsk1 nsk2 nsk3 othr nsk0
    1: GrpA    1   NA    3    a    4
    2: GrpA    3    1    4    b    8
    

    sapply runs through the selected variables, converts each to a numeric and returns a matrix. This matrix is passed to rowSums which sums up the values in each row and ignores the NAs. .SD is a shorthand to select of the data.table, (except variables contained in "by"). However, the variables are selected among them using .SDcols along with grep.

    Some data.tablers might not be happy with this solution as it converts some of the data into a matrix. However, this conversion may be necessary in your case.

    I added an additional "nuisance" variable to show that it will work.

    data

    x <- data.table(a    = c("GrpA", "GrpA"),
                    nsk1 = c("1","3"),
                    nsk2 = c(NA,"1"),
                    nsk3 = c("3", "4"), 
                    othr = letters[1:2])