I am running a k-means algorithm in R and trying to find the optimal number of clusters, k. Using the the silhouette method, the gap statistic, and the elbow method, I determined that the optimal number of clusters is 2. While there are no predefined clusters for the business, I am concerned that k=2 is not too insightful, which leads me to a few questions.
1) What does an optimal k = 2 mean in terms of the data's natural clustering? Does this suggest that maybe there are no clear clusters or that no clusters are better than any clusters?
2) At k = 2, the R-squared is low (.1). At k = 5, the R-squared is much better (.32). What are the exact trade offs on selecting k = 5 knowing it's not optimal? Would it be that you can increase the clusters, but they may not be distinct enough?
3) My n=1000, I have 100 variables to choose from, but only selected 5 from domain knowledge. Would increasing the number of variables necessarily make the clustering better?
4) As a follow up to question 3, if a variable is introduced and lowers the R-squared, what does that say about the variable?
I am no expert but I will try to answer as best as I can:
1) Your optimal cluster number methods gave you k=2 so that would suggest there is clear clustering the number is just low (2). To help with this try using your knowledge of the domain to help with the interpretation, does 2 clusters make sense given your domain?
2) Yes you're correct. The optimal solution in terms of R-squared is to have as many clusters as data points, however this isn't optimal in terms of why you're doing k-means. You're doing k-means to gain more insightful information from the data, this is you're primary goal. As such if you choose k=5 you're data will fit your 5 clusters better but as you say there probably isn't much distinction between them so you're not gaining any insight.
3) Not necessarily, in fact adding blindly could make it worse. K-means operates in euclidean space so every variable is given an even weighting in determining the clusters. If you add variables that are not relevant their values will still distort the n-d space making your clusters worse.
4) (Double check my logic here i'm not 100% on this one) If a variable is introduced to the same number of clusters and it drops the R-squared then yes it is a useful variable to add, it means it has correlation with your other variables.