python cluster-analysis k-means euclidean-distance cosine-similarity

How to find most optimal number of clusters with K-Means clustering in Python

I am new to clustering algorithms. I have a movie dataset with more than 200 movies and more than 100 users. All the users rated at least one movie. A value of 1 for good, 0 for bad and blank if the annotator has no choice.

I want to cluster similar users based on their reviews with the idea that users who rated similar movies as good might also rate a movie as good which was not rated by any user in the same cluster. I used cosine similarity measure with k-means clustering. The csv file is shown below:

  UserID         M1     M2       M3  ...............  M200                          
  user1          1      0                               0     
  user2          0      1        1                                      
  user3          1      1                               1                                                                         
    .
    .
    .
    .
 user100         1      0        1

The problem i am facing is that i don't know exactly how to find most optimal number of clusters for this dataset and then draw a graph of those clusters. I am clustering them with k-means and there is no issue with that but i want to know the most stable or optimal number of clusters for this dataset.

I will appreciate some help..

Solution

Clustering is part of the unsupervised machine learning methods. Contrary to supervised methods, in unsupervised methods there is not a straightforward approach to determine the "best" model among a set of models that were trained on a certain dataset.

Nonetheless, there are some quantitative measures. Most of them are based on the concept of "how much are the points in a certain cluster more similar between themself than with the points in different clusters?" I suggest you take a look at the scikit-learn documentation on clustering evaluation. Take a look at all the techniques that do not require labels_true (i.e. at all the unsupervised techniques). Once you have a quantitative measure about the "goodness" of a certain clustering, you usually observe how this quantity evolves while changing the number of clusters; this approach is called Elbow Method.

Here is some code that uses K-Means algorithm with all possible K values from 2 to 30, calculates various scores for each K value, and stores all scores in a DataFrame.

seed_random = 1

fitted_kmeans = {}
labels_kmeans = {}
df_scores = []
k_values_to_try = np.arange(2, 31)
for n_clusters in k_values_to_try:
    
    #Perform clustering.
    kmeans = KMeans(n_clusters=n_clusters,
                    random_state=seed_random,
                    )
    labels_clusters = kmeans.fit_predict(X)
    
    #Insert fitted model and calculated cluster labels in dictionaries,
    #for further reference.
    fitted_kmeans[n_clusters] = kmeans
    labels_kmeans[n_clusters] = labels_clusters
    
    #Calculate various scores, and save them for further reference.
    silhouette = silhouette_score(X, labels_clusters)
    ch = calinski_harabasz_score(X, labels_clusters)
    db = davies_bouldin_score(X, labels_clusters)
    tmp_scores = {"n_clusters": n_clusters,
                  "silhouette_score": silhouette,
                  "calinski_harabasz_score": ch,
                  "davies_bouldin_score": db,
                  }
    df_scores.append(tmp_scores)

#Create a DataFrame of clustering scores, using `n_clusters` as index, for easier plotting.
df_scores = pd.DataFrame(df_scores)
df_scores.set_index("n_clusters", inplace=True)

This code assumes that all your numerical features are in a DataFrame X. All clustering performance metrics are stored in df_scores DataFrame. You can easily use the elbow method by plotting columns from df_scores; for instance, if you want to see the elbow graph of the Silhouette Score, you can use df_scores["silhouette_score"].plot().