Search code examples
rdata.tablestrsplit

Split a column, get the mean of the split columns, and update the result


I am currently trying to separate a column that contains a range of value (as character) into two numeric columns to compute the mean of them if a row has a range. Then, I want to replace the old column with the updated result one. Actually, there are multiple columns to be done.. I tried to find a way, but it seems challenging to me.

Below is the code that I have tried.. which does not work..

test.val <- data.table(id = c(1, 2, 3), 
                       colA = c("100-150", "200", "300"), 
                       colB = c("15", "20-30", "10"))
test.A <- test.val[, lapply(.SD, function(x){strsplit(x, split = "-")}), .SDcols = c("colA", "colB")]
test.B[, lapply(.SD, mean), .SDcols = c("colA", "colB")]

At the end, I'd like to have the following:

   id colA colB
1:  1  125   15
2:  2  200   25
3:  3  300   10

Anyone that can help me out? Many thanks.


Solution

  • Another option using data.table

    library(data.table)
    cols <- c("colA", "colB")
    for(j in cols) {
      tmp <- vapply(strsplit(test.val[[j]], "-"), 
                    FUN = function(i) mean(as.numeric(i)), 
                    FUN.VALUE = numeric(1))
      set(test.val, j = j, value = tmp)
    }
    test.val
    #   id colA colB
    #1:  1  125   15
    #2:  2  200   25
    #3:  3  300   10
    

    Given a vector

    x <- c("100-150", "200", "300")
    

    the result of strsplit is a list of character vectors

    strsplit(x, "-")
    #[[1]]
    #[1] "100" "150"
    
    #[[2]]
    #[1] "200"
    
    #[[3]]
    #[1] "300"
    

    We wrap this into vapply and calculate the mean for each element after we converted each vector to numeric.

    vapply(strsplit(x, "-"), function(x) mean(as.numeric(x)), numeric(1))
    # [1] 125 200 300
    

    We use this result to replace every column specified in cols using data.table's set function.