Search code examples
rcluster-analysismixed-type

How to decide best number of clusters for kamila clustering with R?


I have a mixed type data set, so I wanted to try kamila clustering. It is easy to apply it, but I would like a plot to decide the number of clusters similar to knee-plot.

data <- read.csv("binarymat.csv",header=FALSE,sep=";")
conInd <- c(9)
conVars <- data[,conInd]
conVars <- data.frame(scale(conVars))
catVarsFac <- data[,c(1,2,3,4,5,6,7,8)]
catVarsFac[] <- lapply(catVarsFac, factor)
catVarsDum <- dummyCodeFactorDf(catVarsFac)
kamRes <- kamila(conVars, catVarsFac, numClust=5, numInit=10,
            calcNumClust = "ps",numPredStrCvRun = 10, predStrThresh = 0.5)
summary(kamRes)

It says that the best number of clusters is 5. How does it decide that and can I see a plot indicating this?


Solution

  • In the kamila package documentation

    Setting calcNumClust to ’ps’ uses the prediction strength method of Tibshirani & Walther (J. of Comp. and Graphical Stats. 14(3), 2005). There is no perfect method for estimating the number of clusters; PS tends to give a smaller number than, say, BIC based methods for large sample sizes.

    In the case, you are using it, you have specified only one value to numClust. So, it doesn't look like you are actually selecting the number of clusters - you have already picked one.

    To select the number of clusters, you have to specify the range you are interested in, for example, numClust = 2 : 7 and also the method for selecting the number of clusters.

    If you also want to select the number of clusters, something like the following might work.

    kamRes <- kamila(conVars, catVarsFac, numClust = 2 : 7, numInit = 10, 
              calcNumClust = "ps", numPredStrCvRun = 10, predStrThresh = 0.5)
    

    Information on the selection of the number of clusters is now present in kamRes$nClust, and plot(2:7, kamRes$nClust$psValues) could be what you are after.