Search code examples
pythonstatisticscluster-analysismetricsdata-science

Is my python implementation of the Davies-Bouldin Index correct?


I'm trying to calculate the Davies-Bouldin Index in Python.

Here are the steps the code below tries to reproduce.

5 Steps:

  1. For each cluster, compute euclidean distances between each point to the centroid
  2. For each cluster, compute the average of these distances
  3. For each pair of clusters, compute the euclidean distance between their centroids

Then,

  1. For each pair of clusters, make the sum of the average distances to their respective centroid (computed at step 2) and divide it by the distance separating them (computed at step 3).

Finally,

  1. Compute the mean of all these divisions (= all indexes) to get the Davies-Bouldin index of the whole clustering

Code

def daviesbouldin(X, labels, centroids):

    import numpy as np
    from scipy.spatial.distance import pdist, euclidean
    
    nbre_of_clusters = len(centroids) #Get the number of clusters
    distances = [[] for e in range(nbre_of_clusters)] #Store intra-cluster distances by cluster
    distances_means = [] #Store the mean of these distances
    DB_indexes = [] #Store Davies_Boulin index of each pair of cluster
    second_cluster_idx = [] #Store index of the second cluster of each pair
    first_cluster_idx = 0 #Set index of first cluster of each pair to 0
    
    # Step 1: Compute euclidean distances between each point of a cluster to their centroid
    for cluster in range(nbre_of_clusters):
        for point in range(X[labels == cluster].shape[0]):
            distances[cluster].append(euclidean(X[labels == cluster][point], centroids[cluster]))
    
    # Step 2: Compute the mean of these distances
    for e in distances:
        distances_means.append(np.mean(e))
  
    # Step 3: Compute euclidean distances between each pair of centroid
    ctrds_distance = pdist(centroids) 
     
    # Tricky step 4: Compute Davies-Bouldin index of each pair of cluster   
    for i, e in enumerate(e for start in range(1, nbre_of_clusters) for e in range(start, nbre_of_clusters)):
        second_cluster_idx.append(e)
        if second_cluster_idx[i-1] == nbre_of_clusters - 1:
            first_cluster_idx += 1
        DB_indexes.append((distances_means[first_cluster_idx] + distances_means[e]) / ctrds_distance[i])
     
    # Step 5: Compute the mean of all DB_indexes   
    print("DAVIES-BOULDIN Index: %.5f" % np.mean(DB_indexes)) 

In arguments:

  • X is the data
  • labels, are the labels computed by a clustering algorithm (i.e: kmeans)
  • centroids are the coordinates of each cluster's centroid (i.e: cluster_centers_)

Also, note that I'm using Python 3

QUESTION1: Is the computation of euclidean distances between each pair of centroid correct (step 3)?

QUESTION2: Is my implementation of step 4 correct?

QUESTION3: Do I need to normalise intra and inter cluster distances ?


Further explanations on Step 4

Let's say we have 10 clusters. The loop should compute the DB index of each pair of cluster.

At the first iteration:

  • sums intra-distances mean of cluster 1 (index 0 of distances_means) and intra-distances mean of cluster 2 (index 1 of distances_means)
  • divides this sum by the distance between the 2 clusters (index 0 of ctrds_distance)

At the second iteration:

  • sums intra-distances mean of cluster 1 (index 0 of distances_means) and intra-distances mean of cluster 3 (index 2 of distances_means)
  • divides this sum by the distance between the 2 clusters (index 1 of ctrds_distance)

and so on...

With the example of 10 clusters, the full iteration process should look like this:

intra-cluster distance intra-cluster distance       distance between their
      of cluster:             of cluster:           centroids(storage num):
         0           +             1            /             0
         0           +             2            /             1
         0           +             3            /             2
         0           +             4            /             3
         0           +             5            /             4
         0           +             6            /             5
         0           +             7            /             6
         0           +             8            /             7
         0           +             9            /             8
         1           +             2            /             9
         1           +             3            /             10
         1           +             4            /             11
         1           +             5            /             12
         1           +             6            /             13
         1           +             7            /             14
         1           +             8            /             15
         1           +             9            /             16
         2           +             3            /             17
         2           +             4            /             18
         2           +             5            /             19
         2           +             6            /             20
         2           +             7            /             21
         2           +             8            /             22
         2           +             9            /             23
         3           +             4            /             24
         3           +             5            /             25
         3           +             6            /             26
         3           +             7            /             27
         3           +             8            /             28
         3           +             9            /             29
         4           +             5            /             30
         4           +             6            /             31
         4           +             7            /             32
         4           +             8            /             33
         4           +             9            /             34
         5           +             6            /             35
         5           +             7            /             36
         5           +             8            /             37
         5           +             9            /             38
         6           +             7            /             39
         6           +             8            /             40
         6           +             9            /             41
         7           +             8            /             42
         7           +             9            /             43
         8           +             9            /             44

The problem here is I'm not quite sure that the index of distances_means matches the index of ctrds_distance.

In other words, I'm not sure that the first inter-cluster distance computed corresponds to the distance between cluster 1 and cluster 2. And that the second inter-cluster distance computed corresponds to the distance between cluster 3 and cluster 1... and so on, following the pattern above.

In short: I'm afraid I'm dividing pairs of intra-cluster distances by an inter-cluster distance that is not corresponding.


Solution

  • Here is a shorter, faster corrected version of the Davies-Bouldin index naive implementation above.

    def DaviesBouldin(X, labels):
        n_cluster = len(np.bincount(labels))
        cluster_k = [X[labels == k] for k in range(n_cluster)]
        centroids = [np.mean(k, axis = 0) for k in cluster_k]
        variances = [np.mean([euclidean(p, centroids[i]) for p in k]) for i, k in enumerate(cluster_k)]
        db = []
    
        for i in range(n_cluster):
            for j in range(n_cluster):
                if j != i:
                    db.append((variances[i] + variances[j]) / euclidean(centroids[i], centroids[j]))
    
        return(np.max(db) / n_cluster)
    

    Answering my own questions:

    • the counter on the first draft (step 4) was correct BUT irrelevant
    • there's no need to normalise intra and inter cluster distances
    • there was a mistake when calculating Euclidean distances

    Note you can find innovative approaches that try to improve this index, notably the "New Version of Davies-Bouldin Index" that replaces Euclidean distance by Cylindrical distance.