I am quite new to sequence analysis and trying to identify clusters in an aggregated sequence matrix, focusing on the state duration. However, when using method='CHI2'/'EUCLID' combined with step=1 (not otherwise) I am getting the error:
Error in if (SCres > currentSCres) { : missing value where TRUE/FALSE needed
Any ideas why (there are some NaN in the distance matrix, could they result from sequences being of different length)?
What the sequence object and distance matrix looks like Code:
Sequence
1 a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a
2 a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a
3 a-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-c-c-c
4 a-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e
5 b-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a
Distance matrix
1 2 3 4
2 NaN
3 289.92897 NaN
4 141.07472 NaN 263.22855
5 10.22425 NaN 290.10919 141.44473
Code:
library(TraMineR) #version 2.0-13
library(WeightedCluster) #version 1.4
SO = seqdef(DAT,right='DEL')
DM = seqdist(SO, method = "CHI2", step=1, full.matrix = F)
FIT = seqpropclust(SO, diss=DM, maxcluster=8,
properties=c("state", "duration", "spell.age","spell.dur",
"transition","pattern", "AFtransition", "AFpattern","Complexity"))
The "CHI2"
distance between two sequences x and y computed by TraMineR
is the sum of the Chi-squared distance between the state distributions over the successive periods of length step
. See Studer and Ritschard (2014, p 8).
This means that for step=1
a Chi-squared distance is computed at each position. When one of the sequence has void values at some positions (e.g. the last position in your second sequence), the distance cannot be computed for these positions and we get a NaN
value for the CHI2
distance between this sequence and any other sequence.
To avoid that, you can use the following workarounds:
1) Set a step
value large enough to be sure each sequence contains at least one non-void element in each period intervals. For your example, the longest sequences are of length 25. To be sure the last period contains non void elements, you have to set step=5
.
DAT <- c("a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a",
"a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a",
"a-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-c-c-c",
"a-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e-e",
"b-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a-a")
SO <- seqdef(DAT)
DM <- seqdist(SO, method = "CHI2", step=5)
DM
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.000000 0.000000 4.543441 4.543441 1.030776
## [2,] 0.000000 0.000000 4.543441 4.543441 1.030776
## [3,] 4.543441 4.543441 0.000000 2.028370 4.604927
## [4,] 4.543441 4.543441 2.028370 0.000000 4.604927
## [5,] 1.030776 1.030776 4.604927 4.604927 0.000000
2) Drop the columns with void elements:
SOdrop <- SO[,1:(ncol(SO)-1)]
SOdrop
DMd <- seqdist(SOdrop, method = "CHI2", step=1)
DMd
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.00000 0.00000 10.041580 10.041580 2.50000
## [2,] 0.00000 0.00000 10.041580 10.041580 2.50000
## [3,] 10.04158 10.04158 0.000000 4.472136 10.34811
## [4,] 10.04158 10.04158 4.472136 0.000000 10.34811
## [5,] 2.50000 2.50000 10.348108 10.348108 0.00000
3) Fill the shorter sequences with missings and consider the missing value as an additional possible state. By default right='DEL'
in seqdef
, which creates voids. Here we set right=NA
to get missing values instead.
SOm = seqdef(DAT, right=NA)
DMm = seqdist(SOm, method = "CHI2", step=1, with.missing=TRUE)
DMm
## [,1] [,2] [,3] [,4] [,5]
## [1,] 0.000000 2.738613 10.408330 10.408330 2.500000
## [2,] 2.738613 0.000000 10.527741 10.527741 3.708099
## [3,] 10.408330 10.527741 0.000000 5.477226 10.704360
## [4,] 10.408330 10.527741 5.477226 0.000000 10.704360
## [5,] 2.500000 3.708099 10.704360 10.704360 0.000000
Now, the error reported in the question is NOT an error of seqdist
, but of the seqpropclust
function from the WeightedCluster
library. The error is obviously caused by the NaN
in the dissimilarity matrix.