Search code examples
python-3.xartificial-intelligencecluster-analysisk-means

Optimizing Dunn Index calculation?


The Dunn Index is a method of evaluating clustering. A higher value is better. It is calculated as the lowest intercluster distance (ie. the smallest distance between any two cluster centroids) divided by the highest intracluster distance (ie. the largest distance between any two points in any cluster).

I have a code snippet for calculating the Dunn Index:

def dunn_index(pf, cf):
    """
    pf -- all data points
    cf -- cluster centroids
    """
    numerator = inf
    for c in cf: # for each cluster
        for t in cf: # for each cluster
            if t is c: continue # if same cluster, ignore
            numerator = min(numerator, distance(t, c)) # find distance between centroids
    denominator = 0
    for c in cf: # for each cluster
        for p in pf: # for each point
            if p.get_cluster() is not c: continue # if point not in cluster, ignore
            for t in pf: # for each point
                if t.get_cluster() is not c: continue # if point not in cluster, ignore
                if t is p: continue # if same point, ignore
                denominator = max(denominator, distance(t, p))
    return numerator/denominator

The problem is this is exceptionally slow: for an example data set consisting of 5000 instances and 15 clusters, the function above needs to perform just over 375 million distance calculations at worst. Realistically it's much lower, but even a best case, where the data is ordered by cluster already, is around 25 million distance calculations. I want to shave time off of it, and I've already tried rectilinear distance vs. euclidean and it's not good.

How can I improve this algorithm?


Solution

  • TLDR: Importantly, the problem is set up in two-dimensions. For large dimensions, these techniques can be ineffective.

    In 2D, we can compute the diameter (intracluster distance) of each cluster in O(n log n) time where n is the cluster size using convex hulls. Vectorization is used to speed up remaining operations. There are two possible asymptotic improvements mentioned at the end of the post, contributions welcome ;)


    Setup and fake data:

    import numpy as np
    from scipy import spatial
    from matplotlib import pyplot as plt
    
    # set up fake data
    np.random.seed(0)
    n_centroids = 1000
    centroids = np.random.rand(n_centroids, 2)
    cluster_sizes = np.random.randint(1, 1000, size=n_centroids)
    # labels from 1 to n_centroids inclusive
    labels = np.repeat(np.arange(n_centroids), cluster_sizes) + 1
    points = np.zeros((cluster_sizes.sum(), 2))
    points[:,0] = np.repeat(centroids[:,0], cluster_sizes)
    points[:,1] = np.repeat(centroids[:,1], cluster_sizes)
    points += 0.05 * np.random.randn(cluster_sizes.sum(), 2)
    

    Looks somewhat like this:

    enter image description here

    Next, we define a diameter function for computing the largest intracluster distance, based on this approach using the convex hull.

    # compute the diameter based on convex hull 
    def diameter(pts):
      # need at least 3 points to construct the convex hull
      if pts.shape[0] <= 1:
        return 0
      if pts.shape[0] == 2:
        return ((pts[0] - pts[1])**2).sum()
      # two points which are fruthest apart will occur as vertices of the convex hull
      hull = spatial.ConvexHull(pts)
      candidates = pts[spatial.ConvexHull(pts).vertices]
      return spatial.distance_matrix(candidates, candidates).max()
    

    For the Dunn index calculation, I assume that we have already computed the points, the cluster labels and the cluster centroids.

    If the number of clusters is large, the following solution based on Pandas may perform well:

    import pandas as pd
    def dunn_index_pandas(pts, labels, centroids):
      # O(k n log(n)) with k clusters and n points; better performance with more even clusters
      max_intracluster_dist = pd.DataFrame(pts).groupby(labels).agg(diameter_pandas)[0].max()
      # O(k^2) with k clusters; can be reduced to O(k log(k))
      # get pairwise distances between centroids
      cluster_dmat = spatial.distance_matrix(centroids, centroids)
      # fill diagonal with +inf: ignore zero distance to self in "min" computation
      np.fill_diagonal(cluster_dmat, np.inf)
      min_intercluster_dist = cluster_sizes.min()
      return min_intercluster_dist / max_intracluster_dist
    

    Otherwise, we can continue with a pure numpy solution.

    def dunn_index(pts, labels, centroids):
      # O(k n log(n)) with k clusters and n points; better performance with more even clusters
      max_intracluster_dist = max(diameter(pts[labels==i]) for i in np.unique(labels))
      # O(k^2) with k clusters; can be reduced to O(k log(k))
      # get pairwise distances between centroids
      cluster_dmat = spatial.distance_matrix(centroids, centroids)
      # fill diagonal with +inf: ignore zero distance to self in "min" computation
      np.fill_diagonal(cluster_dmat, np.inf)
      min_intercluster_dist = cluster_sizes.min()
      return min_intercluster_dist / max_intracluster_dist
    
    %time dunn_index(points, labels, centroids)
    # returned value 2.15
    # in 2.2 seconds
    %time dunn_index_pandas(points, labels, centroids)
    # returned 2.15
    # in 885 ms
    

    For 1000 clusters with i.i.d. ~U[1,1000] cluster sizes this takes 2.2. seconds on my machine. This number drops to .8 seconds with the Pandas approach for this example (many small clusters).

    There are two further optimization opportunities that are relevant when the number of clusters is large:

    • First, I am computing the minimal intercluster distance with a brute force O(k^2) approach where k is the number of clusters. This can be reduced to O(k log(k)), as discussed here.

    • Second, max(diameter(pts[labels==i]) for i in np.unique(labels)) requires k passes over an array of size n. With many clusters this can become the bottleneck (as in this example). This is somewhat mitigated with the pandas approach, but I expect that this can be optimized a lot further. For current parameters, roughly one third of compute time is spent outside of computing intercluser of intracluster distances.