Search code examples
rdata.tablereshapereshape2

Splitting Column of a data.table


Wonder how to split column of a data.table.

Can split column of a data.frame easily with the following code:

df <- data.frame(Test=c("A - B", "C - D"))
df
     Test
1 A - B
2 C - D
library(reshape2)
reshape2:::colsplit(string = df[,1], pattern = " ", names = c("Var1", "Space", "Var2"))

  Var1 Space Var2
1    A     -    B
2    C     -    D

But my attempt of splitting column of a data.table fails

library(data.table)
dt <- data.table(Test=c("A - B", "C - D"))
dt
    Test
1: A - B
2: C - D
reshape2:::colsplit(string = dt[,1, with=FALSE], pattern = " ", names = c("Var1", "Space", "Var2"))
Error: String must be an atomic vector

Solution

  • I see that you're specifically asking for something with colsplit, but I would suggest looking at some of the other alternatives out there, for example my cSplit function.

    The cSplit approach would be as follows:

    setnames(cSplit(dt, "Test",  " "), c("Var1", "Space", "Var2"))[]
    #    Var1 Space Var2
    # 1:    A     -    B
    # 2:    C     -    D
    

    The [] at the end is to print the result, but you can also just store the result as a new data.table instead.


    How does it compare in terms of efficiency?

    fun1 <- function() {
      reshape2:::colsplit(string = dt[[1]], pattern = " ", 
                          names = c("Var1", "Space", "Var2"))
    } 
    fun2 <- function() {
      setnames(cSplit(dt, "Test",  " "), 
               c("Var1", "Space", "Var2"))[]
    }
    
    dt <- rbindlist(replicate(5000, dt, FALSE))
    dim(dt)
    # [1] 10000     1
    
    library(microbenchmark)
    microbenchmark(fun1(), fun2(), times = 10)
    # Unit: milliseconds
    #    expr        min         lq     median         uq        max neval
    #  fun1() 2025.84703 2093.39687 2195.75822 2390.30666 2492.65946    10
    #  fun2()   34.08966   36.01145   43.28036   47.45962   57.57615    10
    

    Why didn't your colsplit approach work as you expected?

    dt[,1] is more like df[,1, drop = FALSE] (try it--you'll get the same error that you got with your "data.table" attempt).

    You will need either of the following:

    > dt[[1]]
    [1] "A - B" "C - D"
    > dt$Test
    [1] "A - B" "C - D"
    

    Which is similar to what you get with:

    > df[, 1]
    [1] A - B C - D
    Levels: A - B C - D