Search code examples
rdata.tableapplylapplystringr

Extract and append unknown number of data.table columns under one new column name in R


I have an R data.table created from splitting a string into paragraphs (using stringr) and then transposing, e.g. from something like:

paras = dat[,transpose(str_extract_all(longstring, myregex))][, index := .I]

where longstring is a multi-paragraph string and regex some regular expression used to split longstring into paragraphs (index is just a column I need for various joins later).

This creates a new data.table paras which has index and as many new columns (V1,V2,V3...) as there are paragraphs in the longest observation of longstring (in my case it goes to V5). Some observations of longstring have fewer paragraphs and then will have NA values in e.g. V5.

I want a new data.table which has only three columns: index, V, where V contains all of the non-NA V1, V2, V3... columns appended together, and Vsource which shows the "paragraph number".

This is easily done with:

newdat = rbind(
      paras[,c('V', 'Vsource') := .(V1, 1)][!is.na(V),.(V,index, Vsource)],
      paras[,c('V', 'Vsource') := .(V2, 2)][!is.na(V),.(V,index, Vsource)],
      paras[,c('V', 'Vsource') := .(V3, 3)][!is.na(V),.(V,index, Vsource)],
      paras[,c('V', 'Vsource') := .(V4, 4)][!is.na(V),.(V,index, Vsource)],
      paras[,c('V', 'Vsource') := .(V5, 5)][!is.na(V),.(V,index, Vsource)]
      )

But the above is a really ugly way of doing this, and also requires me to hardcode the number of V variables. I would like a version which will work however many paragraphs are in longstring. I'm an R beginner and in my usual software (Stata) I could easily do this with a loop. But I can't get a loop to work for this problem in R, and anyway suspect there's some better vectorized way of doing, using something like lapply with .SD? But I'm still learning these methods.

PS: any good tutorials on using lapply, sapply, apply etc with data.table would be much appreciated.


Solution

  • I think this is just a melt. I'll demo with two simple sentences and split on " ", I think it should project on your real data.

    library(data.table)
    DT <- data.table(longstring=c("this is a long sentence", "not as long"))
    myregex <- " "
    paras <- DT[, transpose(strsplit(longstring, myregex))][, index := .I]
    paras
    #        V1     V2     V3     V4       V5 index
    #    <char> <char> <char> <char>   <char> <int>
    # 1:   this     is      a   long sentence     1
    # 2:    not     as   long   <NA>     <NA>     2
    
    yourdat <- rbind(
      copy(paras)[,c('V', 'Vsource') := .(V1, 1)][!is.na(V),.(V,index, Vsource)],
      copy(paras)[,c('V', 'Vsource') := .(V2, 2)][!is.na(V),.(V,index, Vsource)],
      copy(paras)[,c('V', 'Vsource') := .(V3, 3)][!is.na(V),.(V,index, Vsource)],
      copy(paras)[,c('V', 'Vsource') := .(V4, 4)][!is.na(V),.(V,index, Vsource)],
      copy(paras)[,c('V', 'Vsource') := .(V5, 5)][!is.na(V),.(V,index, Vsource)]
    )
    yourdat
    #           V index Vsource
    #      <char> <int>   <num>
    # 1:     this     1       1
    # 2:      not     2       1
    # 3:       is     1       2
    # 4:       as     2       2
    # 5:        a     1       3
    # 6:     long     2       3
    # 7:     long     1       4
    # 8: sentence     1       5
    

    (Note that I used copy(paras) here ... without it, the original paras had V and Vsource columns appended, which would disrupt the melt operation below.)

    And the melt:

    melt(paras, "index", variable.name = "Vsource", value.name = "V", na.rm  =TRUE)
    melted
    #    index Vsource        V
    #    <int>  <fctr>   <char>
    # 1:     1      V1     this
    # 2:     2      V1      not
    # 3:     1      V2       is
    # 4:     2      V2       as
    # 5:     1      V3        a
    # 6:     2      V3     long
    # 7:     1      V4     long
    # 8:     1      V5 sentence
    

    You just need to string-ify and remove the leading "V" to get the same actual values.

    melted[, Vsource := as.integer(sub("^V", "", as.character(Vsource)))]
    melted
    #    index Vsource        V
    #    <int>   <int>   <char>
    # 1:     1       1     this
    # 2:     2       1      not
    # 3:     1       2       is
    # 4:     2       2       as
    # 5:     1       3        a
    # 6:     2       3     long
    # 7:     1       4     long
    # 8:     1       5 sentence
    

    (Assuming you really need an integer.)