Search code examples
scikit-learncluster-analysisk-means

K means with a condition


I want to apply K means ( or any other simple clustering algorithm ) to data with two variables, but i want clusters to respect a condition : the sum of a third variable per cluster > some_value. Is that possible?


Solution

  • Notations :
    - K is the number of clusters
    - let's say that the first two variables are point coordinnates (x,y)
    - V denotes the third variable
    - Ci : the sum of V over each cluster i
    - S the total sum (sum Ci)
    - and the threshold T

    Problem definition :
    From what I understood, the aim is to run an algorithm that keeps the spirit of kmeans while respecting the constraints.

    Task 1 - group points by proximity to centroids [kmeans]
    Task 2 - for each cluster i, Ci > T* [constraint]

    Regular kmeans limitation for the constraint problem :
    A regular kmeans, assign points to centroids by taking them in arbitrary order. In our case, this will lead to uncontrol growth of the Ci while adding points.

    For exemple, with K=2, T=40 and 4 points with the third variables equal to V1=50, V2=1, V3=50, V4=50. Suppose also that point P1, P3, P4 are closer to centroid 1. Point P2 is closer to centroid 2.

    Let's run the assignement step of a regular kmeans and keep track of Ci :
    1-- take point P1, assign it to cluster 1. C1=50 > T
    2-- take point P2, assign it to cluster 2 C2=1
    3-- take point P3, assign it to cluster 1. C1=100 > T => C1 grows too much !
    4-- take point P4, assign it to cluster 1. C1=150 > T => !!!

    Modified kmeans :
    In the previous, we want to prevent C1 from growing too much and help C2 grow.

    This is like pouring champagne into several glasses : if you see a glass with less champaigne, you go and fill it. You do that because you have constraints : limited amound of champaigne (S is bounded) and because you want every glass to have enough champaign (Ci>T).

    Of course this is just a analogy. Our modified kmeans will add new poins to the cluster with minimal Ci until the constraint is achieved (Task2). Now in which order should we add points ? By proximity to centroids (Task1). After all constraints are achieved for all cluster i, we can just run a regular kmeans on remaining unassigned points.

    Implementation :
    Next, I give a python implementation of the modified algorithm. Figure 1 displays a reprensentation of the third variable using transparency for vizualizing large VS low values. Figure 2 displays the evolution clusters using color.

    You can play with the accept_thresh parameter. In particular, note that :
    For accept_thresh=0 => regular kmeans (constraint is reached immediately)
    For accept_thresh = third_var.sum().sum() / (2*K), you might observe that some points that closer to a given centroid are affected to another one for constraint reasons.

    CODE :

    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn import datasets
    import time
    
    nb_samples = 1000
    K = 3  # for demo purpose, used to generate cloud points
    c_std = 1.2
    
    # Generate test samples :
    points, classes = datasets.make_blobs(n_features=2, n_samples=nb_samples, \
                                          centers=K, cluster_std=c_std)
    
    third_var_distribution = 'cubic_bycluster'  # 'uniform'
    
    if third_var_distribution == 'uniform':
        third_var = np.random.random((nb_samples))
    elif third_var_distribution == 'linear_bycluster':
        third_var = np.random.random((nb_samples))
        third_var = third_var * classes
    elif third_var_distribution == 'cubic_bycluster':
        third_var = np.random.random((nb_samples))
        third_var = third_var * classes
    
    
    # Threshold parameters :
    # Try with K=3 and :
    # T = K => one cluster reach cosntraint, two clusters won't converge
    # T = 2K =>
    accept_thresh = third_var.sum().sum() / (2*K)
    
    
    def dist2centroids(points, centroids):
        '''return arrays of ordered points to each centroids
           first array is index of points
           second array is distance to centroid
           dim 0 : centroid
           dim 1 : distance or point index
        '''
        dist = np.sqrt(((points - centroids[:, np.newaxis]) ** 2).sum(axis=2))
        ord_dist_indices = np.argsort(dist, axis=1)
    
        ord_dist_indices = ord_dist_indices.transpose()
        dist = dist.transpose()
    
        return ord_dist_indices, dist
    
    
    def assign_points_with_constraints(inds, dists, tv, accept_thresh):
        assigned = [False] * nb_samples
        assignements = np.ones(nb_samples, dtype=int) * (-1)
        cumul_third_var = np.zeros(K, dtype=float)
        current_inds = np.zeros(K, dtype=int)
    
        max_round = nb_samples * K
    
        for round in range(0, max_round):  # we'll break anyway
            # worst advanced cluster in terms of cumulated third_var :
            cluster = np.argmin(cumul_third_var)
    
            if cumul_third_var[cluster] > accept_thresh:
                continue  # cluster had enough samples
    
            while current_inds[cluster] < nb_samples:
                # add points to increase cumulated third_var on this cluster
                i_inds = current_inds[cluster]
                closest_pt_index = inds[i_inds][cluster]
    
                if assigned[closest_pt_index] == True:
                    current_inds[cluster] += 1
                    continue  # pt already assigned to a cluster
    
                assignements[closest_pt_index] = cluster
                cumul_third_var[cluster] += tv[closest_pt_index]
                assigned[closest_pt_index] = True
                current_inds[cluster] += 1
    
                new_cluster = np.argmin(cumul_third_var)
                if new_cluster != cluster:
                    break
    
        return assignements, cumul_third_var
    
    
    def assign_points_with_kmeans(points, centroids, assignements):
        new_assignements = np.array(assignements, copy=True)
    
        count = -1
        for asg in assignements:
            count += 1
    
            if asg > -1:
                continue
    
            pt = points[count, :]
    
            distances = np.sqrt(((pt - centroids) ** 2).sum(axis=1))
            centroid = np.argmin(distances)
    
            new_assignements[count] = centroid
    
        return new_assignements
    
    
    def move_centroids(points, labels):
        centroids = np.zeros((K, 2), dtype=float)
    
        for k in range(0, K):
            centroids[k] = points[assignements == k].mean(axis=0)
    
        return centroids
    
    
    rgba_colors = np.zeros((third_var.size, 4))
    rgba_colors[:, 0] = 1.0
    rgba_colors[:, 3] = 0.1 + (third_var / max(third_var))/1.12
    plt.figure(1, figsize=(14, 14))
    plt.title("Three blobs", fontsize='small')
    plt.scatter(points[:, 0], points[:, 1], marker='o', c=rgba_colors)
    
    # Initialize centroids
    centroids = np.random.random((K, 2)) * 10
    plt.scatter(centroids[:, 0], centroids[:, 1], marker='X', color='red')
    
    # Step 1 : order points by distance to centroid :
    inds, dists = dist2centroids(points, centroids)
    
    # Check if clustering is theoriticaly possible :
    tv_sum = third_var.sum()
    tv_max = third_var.max()
    if (tv_max > 1 / 3 * tv_sum):
        print("No solution to the clustering problem !\n")
        print("For one point : third variable is too high.")
        sys.exit(0)
    
    stop_criter_eps = 0.001
    epsilon = 100000
    prev_cumdist = 100000
    
    plt.figure(2, figsize=(14, 14))
    ln, = plt.plot([])
    plt.ion()
    plt.show()
    
    while epsilon > stop_criter_eps:
    
        # Modified kmeans assignment :
        assignements, cumul_third_var = assign_points_with_constraints(inds, dists, third_var, accept_thresh)
    
        # Kmeans on remaining points :
        assignements = assign_points_with_kmeans(points, centroids, assignements)
    
        centroids = move_centroids(points, assignements)
    
        inds, dists = dist2centroids(points, centroids)
    
        epsilon = np.abs(prev_cumdist - dists.sum().sum())
    
        print("Delta on error :", epsilon)
    
        prev_cumdist = dists.sum().sum()
    
        plt.clf()
        plt.title("Current Assignements", fontsize='small')
        plt.scatter(points[:, 0], points[:, 1], marker='o', c=assignements)
        plt.scatter(centroids[:, 0], centroids[:, 1], marker='o', color='red', linewidths=10)
        plt.text(0,0,"THRESHOLD T = "+str(accept_thresh), va='top', ha='left', color="red", fontsize='x-large')
        for k in range(0, K):
            plt.text(centroids[k, 0], centroids[k, 1] + 0.7, "Ci = "+str(cumul_third_var[k]))
        plt.show()
        plt.pause(1)
    

    Improvements :
    - use the distribution of the third variable for assignments.
    - manage divergence of the algorithm
    - better initialization (kmeans++)