I have a mixed type data set, so I wanted to try kamila
clustering. It is easy to apply it, but I would like a plot to decide the number of clusters similar to knee-plot.
data <- read.csv("binarymat.csv",header=FALSE,sep=";")
conInd <- c(9)
conVars <- data[,conInd]
conVars <- data.frame(scale(conVars))
catVarsFac <- data[,c(1,2,3,4,5,6,7,8)]
catVarsFac[] <- lapply(catVarsFac, factor)
catVarsDum <- dummyCodeFactorDf(catVarsFac)
kamRes <- kamila(conVars, catVarsFac, numClust=5, numInit=10,
calcNumClust = "ps",numPredStrCvRun = 10, predStrThresh = 0.5)
summary(kamRes)
It says that the best number of clusters is 5. How does it decide that and can I see a plot indicating this?
In the kamila
package documentation
Setting calcNumClust to ’ps’ uses the prediction strength method of Tibshirani & Walther (J. of Comp. and Graphical Stats. 14(3), 2005). There is no perfect method for estimating the number of clusters; PS tends to give a smaller number than, say, BIC based methods for large sample sizes.
In the case, you are using it, you have specified only one value to numClust
. So, it doesn't look like you are actually selecting the number of clusters - you have already picked one.
To select the number of clusters, you have to specify the range you are interested in, for example, numClust = 2 : 7
and also the method for selecting the number of clusters.
If you also want to select the number of clusters, something like the following might work.
kamRes <- kamila(conVars, catVarsFac, numClust = 2 : 7, numInit = 10,
calcNumClust = "ps", numPredStrCvRun = 10, predStrThresh = 0.5)
Information on the selection of the number of clusters is now present in
kamRes$nClust
, and plot(2:7, kamRes$nClust$psValues)
could be what you are after.