I have an R data.table
created from splitting a string into paragraphs (using stringr
) and then transposing, e.g. from something like:
paras = dat[,transpose(str_extract_all(longstring, myregex))][, index := .I]
where longstring
is a multi-paragraph string and regex
some regular expression used to split longstring
into paragraphs (index
is just a column I need for various joins later).
This creates a new data.table paras
which has index
and as many new columns (V1
,V2
,V3
...) as there are paragraphs in the longest observation of longstring
(in my case it goes to V5
). Some observations of longstring
have fewer paragraphs and then will have NA
values in e.g. V5
.
I want a new data.table which has only three columns: index
, V
, where V
contains all of the non-NA V1
, V2
, V3
... columns appended together, and Vsource
which shows the "paragraph number".
This is easily done with:
newdat = rbind(
paras[,c('V', 'Vsource') := .(V1, 1)][!is.na(V),.(V,index, Vsource)],
paras[,c('V', 'Vsource') := .(V2, 2)][!is.na(V),.(V,index, Vsource)],
paras[,c('V', 'Vsource') := .(V3, 3)][!is.na(V),.(V,index, Vsource)],
paras[,c('V', 'Vsource') := .(V4, 4)][!is.na(V),.(V,index, Vsource)],
paras[,c('V', 'Vsource') := .(V5, 5)][!is.na(V),.(V,index, Vsource)]
)
But the above is a really ugly way of doing this, and also requires me to hardcode the number of V variables. I would like a version which will work however many paragraphs are in longstring
. I'm an R beginner and in my usual software (Stata) I could easily do this with a loop. But I can't get a loop to work for this problem in R, and anyway suspect there's some better vectorized way of doing, using something like lapply
with .SD
? But I'm still learning these methods.
PS: any good tutorials on using lapply
, sapply
, apply
etc with data.table would be much appreciated.
I think this is just a melt
. I'll demo with two simple sentences and split on " "
, I think it should project on your real data.
library(data.table)
DT <- data.table(longstring=c("this is a long sentence", "not as long"))
myregex <- " "
paras <- DT[, transpose(strsplit(longstring, myregex))][, index := .I]
paras
# V1 V2 V3 V4 V5 index
# <char> <char> <char> <char> <char> <int>
# 1: this is a long sentence 1
# 2: not as long <NA> <NA> 2
yourdat <- rbind(
copy(paras)[,c('V', 'Vsource') := .(V1, 1)][!is.na(V),.(V,index, Vsource)],
copy(paras)[,c('V', 'Vsource') := .(V2, 2)][!is.na(V),.(V,index, Vsource)],
copy(paras)[,c('V', 'Vsource') := .(V3, 3)][!is.na(V),.(V,index, Vsource)],
copy(paras)[,c('V', 'Vsource') := .(V4, 4)][!is.na(V),.(V,index, Vsource)],
copy(paras)[,c('V', 'Vsource') := .(V5, 5)][!is.na(V),.(V,index, Vsource)]
)
yourdat
# V index Vsource
# <char> <int> <num>
# 1: this 1 1
# 2: not 2 1
# 3: is 1 2
# 4: as 2 2
# 5: a 1 3
# 6: long 2 3
# 7: long 1 4
# 8: sentence 1 5
(Note that I used copy(paras)
here ... without it, the original paras
had V
and Vsource
columns appended, which would disrupt the melt
operation below.)
And the melt:
melt(paras, "index", variable.name = "Vsource", value.name = "V", na.rm =TRUE)
melted
# index Vsource V
# <int> <fctr> <char>
# 1: 1 V1 this
# 2: 2 V1 not
# 3: 1 V2 is
# 4: 2 V2 as
# 5: 1 V3 a
# 6: 2 V3 long
# 7: 1 V4 long
# 8: 1 V5 sentence
You just need to string-ify and remove the leading "V"
to get the same actual values.
melted[, Vsource := as.integer(sub("^V", "", as.character(Vsource)))]
melted
# index Vsource V
# <int> <int> <char>
# 1: 1 1 this
# 2: 2 1 not
# 3: 1 2 is
# 4: 2 2 as
# 5: 1 3 a
# 6: 2 3 long
# 7: 1 4 long
# 8: 1 5 sentence
(Assuming you really need an integer.)