Search code examples
pythonk-means

Error with optimalK of gap-statistic: 'int' object is not iterable


I was playing around with this kaggle kernel which is about running k-means for text clustering. I wanted to extend it by automating the identification of optimal k value for the clustering. I am trying to use gap-statistic for this purpose.

As a first step, I had to install the package with: conda install -c milesgranger gap-stat

Then, I tried the following piece of code.

from gap_statistic import OptimalK
optimalK = OptimalK(parallel_backend='rust')
k, gapdf = optimalK(X, cluster_array=np.arange(1, 11))

Which ended up in this error: ValueError: Sparse matrices are not supported by this function. Perhaps one of the scipy.sparse.linalg functions would work instead. I understood that I had change the last line of code to k, gapdf = optimalK(X.toarray(), cluster_array=np.arange(1, 11)) as the optimalK function would accept numpy array.

This change handled the first error very well. And then landed in another error: TypeError: 'int' object is not iterable

Guessing this as an exception left unhandled inside optimalK. Despite that, is there anything I can do to solve this problem.


Solution

  • k, gapdf = optimalK(X.toarray(), cluster_array=np.arange(1, 11)) is conflicting to the source code of OptimalK, as in ForceBru's Answer.

    The following code change will remove the error. And it is the correct equivalent to the erroneous snippet.

    # optimal k value
    k = optimalK(X.toarray(), cluster_array=np.arange(1, 11))
    
    # dataframe with gap values
    gapdf = optimalK.gap_df.head()