Search code examples
rlistrowsum

Select columns using a vector


I am trying to create a new data frame with 2 columns: var1 and var2, each one of them is the row sum of specific columns in data frame sampData.

library(dplyr)

sampData <-
  rnorm(260) %>%
  matrix(ncol = 26) %>%
  data.frame() %>%
  setNames(LETTERS)

var1 <- c("A", "B", "C")
var2 <- c("D", "E", "F", "G")

I know that I can select columns using [] and c(), like this:

sampData[ ,c("A","B")]

but when I try to generate and use that format from my vectors like this:

d1_ <-paste(var1, collapse=",")
d2_ <-paste(var2, collapse=",")

sampData[ ,d1_]

I get this error:

Error in `[.data.frame`(sampData, , d1_) : undefined columns selected

Which I also get if I try to calculate the rowSums -- which is what I am interested in getting.

data.frame(var1 = rowSums(sampData[ , d1_])
           , var2 = rowSums(sampData[ , d2_])

Solution

  • I think I have managed to figure out what you are asking, but if I am wrong, let me know.

    You are trying to select columns from prep that match the values in l1 and l2, and sum across the rows, limited to the columns that matched each.

    It is always better to provide reproducible data, here is some for this case (using dplyr to build it):

    sampData <-
      rnorm(260) %>%
      matrix(ncol = 26) %>%
      data.frame() %>%
      setNames(LETTERS)
    
    var1 <- c("A", "B", "C")
    var2 <- c("D", "E", "F", "G")
    

    Then, you don't need to concatenate the column indices at all -- just use the variable (or column, in your case) directly. Here, I have made the ID's letters and will match the letters. However, if your ID's are numeric, it will match that index (e.g., 3 will return the third column).

    data.frame(
      var1sums = rowSums(sampData[, var1])
      , var2sums = rowSums(sampData[, var2])
    )
    

    Of note, cat returns NULL after printing to the screen. If you need to concatenate values, you will need to use paste (or similar), but that will not work for what you are trying to do here.

    This question got me thinking about flexibility of such solutions, so here is an attempt using dplyr and tidyr, which yields effectively the same result. The difference is that this may provide more flexibility for variable selection or even downstream processing.

    sampData %>%
      # add column for individual
      mutate(ind = 1:nrow(.)) %>%
      # convert data to long format
      gather("Variable", "Value", -ind) %>%
      # Set to group by the individual we added above
      group_by(ind) %>%
      # Calculate sums as desired
      summarise(
        var1sums = sum(Value[Variable %in% var1])
        , var2sums = sum(Value[Variable %in% var2])
      )
    

    However, the real advantage would come if you had an arbitrary number (or just a large number generally) of sets of variables that you wanted to get the individual sums from. Instead of manually constructing every column you might be interested in, you can use standard evaluation (as opposed to non-standard) to automatically generate the columns based on a named list of vectors:

    sampData %>%
      mutate(ind = 1:nrow(.)) %>%
      gather("Variable", "Value", -ind) %>%
      group_by(ind) %>%
      # Calculate one column for each vector in `varList`
      summarise_(
        .dots = lapply(varList, function(x){
          paste0("sum(Value[Variable %in% c('"
                 , paste(x, collapse = "', '")
                 , "')])")
        })
      )