Search code examples
rdata.tablemultiple-columnsstrsplitragged

Splitting text column into ragged multiple new columns in a data table in R


I have a data table containing 20000+ rows and one column. The string in each column has different number of words. I want to split the words and put each of them in a new column. I know how I can do it word by word:

Data [ , Word1 := as.character(lapply(strsplit(as.character(Data$complaint), split=" "), "[", 1))]

(Data is my data table and complaint is the name of the column)

Obviously, this is not efficient because each cell in each row has different number of words.

Could you please tell me about a more efficient way to do this?


Solution

  • Check out cSplit from my "splitstackshape" package. It works on either data.frames or data.tables (but always returns a data.table).

    Assuming KFB's sample data is at least slightly representative of your actual data, you can try:

    library(splitstackshape)
    cSplit(df, "x", " ")
    #     x_1      x_2         x_3 x_4
    # 1: This       is interesting  NA
    # 2: This actually          is not
    

    Another (blazing) option is to use stri_split_fixed with simplify = TRUE (from "stringi") (which is obviously deemed to enter the "splitstackshape" code soon):

    library(stringi)
    stri_split_fixed(df$x, " ", simplify = TRUE)
    #      [,1]   [,2]       [,3]          [,4] 
    # [1,] "This" "is"       "interesting" NA   
    # [2,] "This" "actually" "is"          "not"