Search code examples

stri_split_fixed in a data.table in R

I have a data.table DT as follows.

DT <- structure(list(V1 = structure(1:3, .Label = c("S01", "S02", "S03" ), class = "factor"), V2 = structure(c(1L, 3L, 2L), .Label = c("Alan Hal << Guy John", "Bruce Dick Jean-Paul << Damien", "Jay << Barry Wally Bart"), class = "factor")), .Names = c("V1", "V2"), row.names = c(NA, -3L), class = "data.frame")
# DT
#    V1                             V2
# 1 S01           Alan Hal << Guy John
# 2 S02        Jay << Barry Wally Bart
# 3 S03 Bruce Dick Jean-Paul << Damien

I am trying to split the column V2 at "<<" and the get the output in two new columns.

I could get it done as follows using stringi

T <-,  stri_split_fixed(DT$V2, "<<", 2)))
setnames(T, old = colnames(T), new = c("V3", "V4"))
cbind(DT, T)
V1                             V2                    V3                V4
1: S01           Alan Hal << Guy John             Alan Hal           Guy John
2: S02        Jay << Barry Wally Bart                  Jay   Barry Wally Bart
3: S03 Bruce Dick Jean-Paul << Damien Bruce Dick Jean-Paul             Damien

However I would like to do the same by reference using the := operator. How to do this using data.table?

I am having difficulty with the RHS part.

DT[, c("V1", "V2) := list()]

stri_split_fixed(DT$V2, "<<", 2) gives a list of 3 with character vectors of length 2. How to get a list of 2 with character vectors of length 3?


  • You could try

    setDT(DT)[, c('V3', 'V4'),
                        stri_split_fixed(V2, ' << ', 2))][]
    #  V1                             V2                    V3                V4
    #1: S01           Alan Hal << Guy John             Alan Hal           Guy John
    #2: S02        Jay << Barry Wally Bart                  Jay   Barry Wally Bart
    #3: S03 Bruce Dick Jean-Paul << Damien Bruce Dick Jean-Paul             Damien

    Or you could use strsplit (from @David Arenburg's comments)

     setDT(DT)[, c('V3', 'V4'):=,
                       strsplit(as.character(V2), " << "))] 

    More efficient option (as suggested by @Ananda Mahto)

    cbind(DT, `colnames<-`(stri_split_fixed(DT$V2,
                  " << ", simplify = TRUE), c("V3", "V4")))

    Another option would be to use cSplit from splitstackshape

    cSplit(DT, 'V2', ' << ', stripWhite=FALSE, drop=FALSE)
    #       V1                             V2                 V2_1             V2_2
    #1: S01           Alan Hal << Guy John             Alan Hal         Guy John
    #2: S02        Jay << Barry Wally Bart                  Jay Barry Wally Bart
    #3: S03 Bruce Dick Jean-Paul << Damien Bruce Dick Jean-Paul           Damien

    A faster version of cSplit which gives similar performance as stri_split is available in Gist