Formatting data for event sequencing in TraMineR

I would like to examine the relative turbulence of text within a series of text compositions, using the seqST() function of the TraMineR package. Within my data frame, each row (N=65) has a single column housing the full text of the composition. To calculate the turbulence of each composition, I believe I need to first (a) use the seqdef() function on my data to define a sequence object and then (b) input that sequence object into the turbulence function, seqST(). However, I'm not sure how to properly format my data for the first step. Most of the examples I can find are, sensibly, life course studies, in which the data are formatted as one column per sequence item.

Questions:

1) To create a sequence object would I need to first format my data so that each column contains a single word of composition (rather than the full composition)? If so, any suggestions on the easiest means by which to do so?

2) Is there any reason to believe this approach would a) not work with compositions of variable lengths and/or b) compositions exceeding a particularl length?

3) Text compositions, intuitively, can be more variable than most life cycle state values (i.e., vocabularies can be quite large). Does TraMineR have a cap on the number of possible state values it can reliably factor when deriving values for turbulence, entropy, etc.?

Thanks; any guidance is appreciated.

Solution

I illustrate below how to proceed using the first two sentences of each of the three texts of your example data. I assumed that sentences are separated by the period, but did not handle commas. So you may have first eliminate the commas. Also, in the code below I use tolower to ignore capitalization. We simply use the seqdecomp function of TraMineR to transform your text into table form and then input the table to seqdef.

text = c(
  "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat",
  "Tristique nulla aliquet enim tortor at auctor urna nunc Magna fermentum iaculis eu non diam phasellus vestibulum",
  "Quam adipiscing vitae proin sagittis nisl rhoncus mattis rhoncus Facilisi morbi tempus iaculis urna id"
)

library(TraMineR)
d.text <- seqdecomp(tolower(text), sep=" ")
s.text <- seqdef(d.text)

entr <- seqient(s.text)
cplx <- seqici(s.text) 
turb <- seqST(s.text)

data.frame(entr,cplx,turb)

##       Entropy         C Turbulence
## [1] 0.8528759 0.9235128   35.98833
## [2] 0.6919821 0.8318546   17.00000
## [3] 0.6388399 0.7992746   14.80735

Here, we have computed the longitudinal entropy, the complexity index, and the turbulence.

There is no known limitation to the size of the alphabet for the computation of the above indexes, except that it may increase computation time. Too large alphabets become an issue essentially for graphical representations of the sequences because of the difficulty to find contrasting colors.

A known drawback of the turbulence is that, unlike the complexity index, it ignores states non present in the sequence. Moreover, the computation of the turbulence may be much more time consuming. Therefore we would recommend using the complexity index.