Search code examples
pythoncluster-analysisk-meanseuclidean-distancecosine-similarity

How to find most optimal number of clusters with K-Means clustering in Python


I am new to clustering algorithms. I have a movie dataset with more than 200 movies and more than 100 users. All the users rated at least one movie. A value of 1 for good, 0 for bad and blank if the annotator has no choice.

I want to cluster similar users based on their reviews with the idea that users who rated similar movies as good might also rate a movie as good which was not rated by any user in the same cluster. I used cosine similarity measure with k-means clustering. The csv file is shown below:

  UserID         M1     M2       M3  ...............  M200                          
  user1          1      0                               0     
  user2          0      1        1                                      
  user3          1      1                               1                                                                         
    .
    .
    .
    .
 user100         1      0        1                                       

The problem i am facing is that i don't know exactly how to find most optimal number of clusters for this dataset and then draw a graph of those clusters. I am clustering them with k-means and there is no issue with that but i want to know the most stable or optimal number of clusters for this dataset.

I will appreciate some help..


Solution

  • Clustering is part of the unsupervised machine learning methods. Contrary to supervised methods, in unsupervised methods there is not a straightforward approach to determine the "best" model among a set of models that were trained on a certain dataset.

    Nonetheless, there are some quantitative measures. Most of them are based on the concept of "how much are the points in a certain cluster more similar between themself than with the points in different clusters?" I suggest you take a look at the scikit-learn documentation on clustering evaluation. Take a look at all the techniques that do not require labels_true (i.e. at all the unsupervised techniques). Once you have a quantitative measure about the "goodness" of a certain clustering, you usually observe how this quantity evolves while changing the number of clusters; this approach is called Elbow Method.

    Here is some code that uses K-Means algorithm with all possible K values from 2 to 30, calculates various scores for each K value, and stores all scores in a DataFrame.

    seed_random = 1
    
    fitted_kmeans = {}
    labels_kmeans = {}
    df_scores = []
    k_values_to_try = np.arange(2, 31)
    for n_clusters in k_values_to_try:
        
        #Perform clustering.
        kmeans = KMeans(n_clusters=n_clusters,
                        random_state=seed_random,
                        )
        labels_clusters = kmeans.fit_predict(X)
        
        #Insert fitted model and calculated cluster labels in dictionaries,
        #for further reference.
        fitted_kmeans[n_clusters] = kmeans
        labels_kmeans[n_clusters] = labels_clusters
        
        #Calculate various scores, and save them for further reference.
        silhouette = silhouette_score(X, labels_clusters)
        ch = calinski_harabasz_score(X, labels_clusters)
        db = davies_bouldin_score(X, labels_clusters)
        tmp_scores = {"n_clusters": n_clusters,
                      "silhouette_score": silhouette,
                      "calinski_harabasz_score": ch,
                      "davies_bouldin_score": db,
                      }
        df_scores.append(tmp_scores)
    
    #Create a DataFrame of clustering scores, using `n_clusters` as index, for easier plotting.
    df_scores = pd.DataFrame(df_scores)
    df_scores.set_index("n_clusters", inplace=True)
    

    This code assumes that all your numerical features are in a DataFrame X. All clustering performance metrics are stored in df_scores DataFrame. You can easily use the elbow method by plotting columns from df_scores; for instance, if you want to see the elbow graph of the Silhouette Score, you can use df_scores["silhouette_score"].plot().