I was playing around with this kaggle kernel which is about running k-means for text clustering. I wanted to extend it by automating the identification of optimal k value for the clustering. I am trying to use gap-statistic
for this purpose.
As a first step, I had to install the package with: conda install -c milesgranger gap-stat
Then, I tried the following piece of code.
from gap_statistic import OptimalK
optimalK = OptimalK(parallel_backend='rust')
k, gapdf = optimalK(X, cluster_array=np.arange(1, 11))
Which ended up in this error:
ValueError: Sparse matrices are not supported by this function. Perhaps one of the scipy.sparse.linalg functions would work instead. I understood that I had change the last line of code to k, gapdf = optimalK(X.toarray(), cluster_array=np.arange(1, 11))
as the optimalK function would accept numpy array.
This change handled the first error very well. And then landed in another error: TypeError: 'int' object is not iterable
Guessing this as an exception left unhandled inside optimalK. Despite that, is there anything I can do to solve this problem.
k, gapdf = optimalK(X.toarray(), cluster_array=np.arange(1, 11))
is conflicting to the source code of OptimalK, as in ForceBru's Answer.
The following code change will remove the error. And it is the correct equivalent to the erroneous snippet.
# optimal k value
k = optimalK(X.toarray(), cluster_array=np.arange(1, 11))
# dataframe with gap values
gapdf = optimalK.gap_df.head()