Search code examples
rnlpcluster-analysisk-meansunsupervised-learning

Unexpected clustering errors (partitioning around mediods)


I am using the fpc package for determining the optimal number of clusters. The pamk() function takes a dissimilarity matrix as an argument and does not require the user to specify k. According to the documentation:

pamk() This calls pam and clara for the partitioning around medoids clustering method (Kaufman and Rouseeuw, 1990) and includes two different ways of estimating the number of clusters.

but when I input two very similar matricies - foo and bar (data below), the function errors out on the second matrix (bar)

Error in pam(sdata, k, diss = diss, ...) : 
  Number of clusters 'k' must be in {1,2, .., n-1}; hence n >= 2 

What could be causing this error, given that the input matricies are basically the same? For example:

foo works!

hc <- hclust(as.dist(foo))
plot(hc)
pamk.best <- fpc::pamk(foo)
pamk.best$nc
[1] 2

enter image description here

bar does not

hc <- hclust(as.dist(bar))
plot(hc, main = 'bar dendogram')
pamk.best <- fpc::pamk(bar)
Error in pam(sdata, k, diss = diss, ...) : 
  Number of clusters 'k' must be in {1,2, .., n-1}; hence n >= 2

enter image description here

Any suggestions would be helpful!

dput(foo)
structure(c(0, 0, 0, 0, 0, 0, 0, 9, 0, 0, 0, 0, 0, 9, 0, 0, 0, 
0, 0, 0, 0, 9, 0, 0, 0, 0, 0, 9, 0, 0, 0, 0, 0, 0, 0, 9, 0, 0, 
0, 0, 0, 9, 0, 0, 0, 0, 0, 0, 0, 9, 0, 0, 0, 0, 0, 9, 0, 0, 0, 
0, 0, 0, 0, 9, 0, 0, 0, 0, 0, 9, 0, 0, 0, 0, 0, 0, 0, 9, 0, 0, 
0, 0, 0, 9, 0, 0, 0, 0, 0, 0, 0, 9, 0, 0, 0, 0, 0, 9, 9, 9, 9, 
9, 9, 9, 9, 0, 9, 9, 9, 9, 9, 0, 0, 0, 0, 0, 0, 0, 0, 9, 0, 0, 
0, 0, 0, 9, 0, 0, 0, 0, 0, 0, 0, 9, 0, 0, 0, 0, 0, 9, 0, 0, 0, 
0, 0, 0, 0, 9, 0, 0, 0, 0, 0, 9, 0, 0, 0, 0, 0, 0, 0, 9, 0, 0, 
0, 0, 0, 9, 0, 0, 0, 0, 0, 0, 0, 9, 0, 0, 0, 0, 0, 9, 9, 9, 9, 
9, 9, 9, 9, 0, 9, 9, 9, 9, 9, 0), .Dim = c(14L, 14L), .Dimnames = list(
    c("etc", "etc", "etc", "etc", "etc", "etc", "etc", "similares", 
    "etc", "etc", "etc", "etc", "etc", "similares"), NULL))

dput(bar)
structure(c(0, 6, 6, 6, 6, 6, 0, 0, 0, 0, 6, 0, 0, 0, 0, 6, 0, 
0, 0, 0, 6, 0, 0, 0, 0), .Dim = c(5L, 5L), .Dimnames = list(c("ramírez", 
"similares", "similares", "similares", "similares"), NULL))

Solution

  • bar has n=5 columns, so the max(krange) has to be <= n-1, thus 4. The default krange is 2:10, hence the error. You may have to pass an appropriate krange; try:

    pamk.best <- fpc::pamk(bar, krange=c(2:(dim(bar)[2]-1)))