r cluster-analysis traminer sequence-analysis

Sequence analysis clustering CHI2 EUCLID error

I am quite new to sequence analysis and trying to identify clusters in an aggregated sequence matrix, focusing on the state duration. However, when using method='CHI2'/'EUCLID' combined with step=1 (not otherwise) I am getting the error:

Error in if (SCres > currentSCres) { : missing value where TRUE/FALSE needed

Any ideas why (there are some NaN in the distance matrix, could they result from sequences being of different length)?

What the sequence object and distance matrix looks like Code:

Sequence                                         
1    a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a
2    a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a  
3    a-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-c-c-c
4    a-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e
5    b-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a

Distance matrix
           1         2      3          4
2        NaN                              
3        289.92897   NaN                    
4        141.07472   NaN    263.22855          
5        10.22425    NaN    290.10919  141.44473

Code:

library(TraMineR) #version 2.0-13
library(WeightedCluster) #version 1.4

SO = seqdef(DAT,right='DEL')
DM = seqdist(SO, method = "CHI2", step=1, full.matrix = F)
FIT = seqpropclust(SO, diss=DM, maxcluster=8, 
      properties=c("state", "duration", "spell.age","spell.dur",
        "transition","pattern", "AFtransition", "AFpattern","Complexity"))

Solution

The "CHI2" distance between two sequences x and y computed by TraMineR is the sum of the Chi-squared distance between the state distributions over the successive periods of length step. See Studer and Ritschard (2014, p 8).

This means that for step=1 a Chi-squared distance is computed at each position. When one of the sequence has void values at some positions (e.g. the last position in your second sequence), the distance cannot be computed for these positions and we get a NaN value for the CHI2 distance between this sequence and any other sequence.

To avoid that, you can use the following workarounds:

1) Set a step value large enough to be sure each sequence contains at least one non-void element in each period intervals. For your example, the longest sequences are of length 25. To be sure the last period contains non void elements, you have to set step=5.

DAT <- c("a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a",
         "a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a",  
         "a-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-c-c-c",
         "a-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e",
         "b-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a")
SO <- seqdef(DAT)
DM <- seqdist(SO, method = "CHI2", step=5)
DM
##          [,1]     [,2]     [,3]     [,4]     [,5]
## [1,] 0.000000 0.000000 4.543441 4.543441 1.030776
## [2,] 0.000000 0.000000 4.543441 4.543441 1.030776
## [3,] 4.543441 4.543441 0.000000 2.028370 4.604927
## [4,] 4.543441 4.543441 2.028370 0.000000 4.604927
## [5,] 1.030776 1.030776 4.604927 4.604927 0.000000

2) Drop the columns with void elements:

SOdrop <- SO[,1:(ncol(SO)-1)]
SOdrop
DMd <- seqdist(SOdrop, method = "CHI2", step=1)
DMd
##          [,1]     [,2]      [,3]      [,4]     [,5]
## [1,]  0.00000  0.00000 10.041580 10.041580  2.50000
## [2,]  0.00000  0.00000 10.041580 10.041580  2.50000
## [3,] 10.04158 10.04158  0.000000  4.472136 10.34811
## [4,] 10.04158 10.04158  4.472136  0.000000 10.34811
## [5,]  2.50000  2.50000 10.348108 10.348108  0.00000

3) Fill the shorter sequences with missings and consider the missing value as an additional possible state. By default right='DEL' in seqdef, which creates voids. Here we set right=NA to get missing values instead.

SOm = seqdef(DAT, right=NA)
DMm = seqdist(SOm, method = "CHI2", step=1, with.missing=TRUE)
DMm
##          [,1]      [,2]      [,3]      [,4]      [,5]
## [1,]  0.000000  2.738613 10.408330 10.408330  2.500000
## [2,]  2.738613  0.000000 10.527741 10.527741  3.708099
## [3,] 10.408330 10.527741  0.000000  5.477226 10.704360
## [4,] 10.408330 10.527741  5.477226  0.000000 10.704360
## [5,]  2.500000  3.708099 10.704360 10.704360  0.000000

Now, the error reported in the question is NOT an error of seqdist, but of the seqpropclust function from the WeightedCluster library. The error is obviously caused by the NaN in the dissimilarity matrix.