K means with a condition

I want to apply K means ( or any other simple clustering algorithm ) to data with two variables, but i want clusters to respect a condition : the sum of a third variable per cluster > some_value. Is that possible?

Solution

Notations :
- K is the number of clusters
- let's say that the first two variables are point coordinnates (x,y)
- V denotes the third variable
- Ci : the sum of V over each cluster i
- S the total sum (sum Ci)
- and the threshold T

Problem definition :
From what I understood, the aim is to run an algorithm that keeps the spirit of kmeans while respecting the constraints.

Task 1 - group points by proximity to centroids [kmeans]
Task 2 - for each cluster i, Ci > T* [constraint]

Regular kmeans limitation for the constraint problem :
A regular kmeans, assign points to centroids by taking them in arbitrary order. In our case, this will lead to uncontrol growth of the Ci while adding points.

For exemple, with K=2, T=40 and 4 points with the third variables equal to V1=50, V2=1, V3=50, V4=50. Suppose also that point P1, P3, P4 are closer to centroid 1. Point P2 is closer to centroid 2.

Let's run the assignement step of a regular kmeans and keep track of Ci :
1-- take point P1, assign it to cluster 1. C1=50 > T
2-- take point P2, assign it to cluster 2 C2=1
3-- take point P3, assign it to cluster 1. C1=100 > T => C1 grows too much !
4-- take point P4, assign it to cluster 1. C1=150 > T => !!!

Modified kmeans :
In the previous, we want to prevent C1 from growing too much and help C2 grow.

This is like pouring champagne into several glasses : if you see a glass with less champaigne, you go and fill it. You do that because you have constraints : limited amound of champaigne (S is bounded) and because you want every glass to have enough champaign (Ci>T).

Of course this is just a analogy. Our modified kmeans will add new poins to the cluster with minimal Ci until the constraint is achieved (Task2). Now in which order should we add points ? By proximity to centroids (Task1). After all constraints are achieved for all cluster i, we can just run a regular kmeans on remaining unassigned points.

Implementation :
Next, I give a python implementation of the modified algorithm. Figure 1 displays a reprensentation of the third variable using transparency for vizualizing large VS low values. Figure 2 displays the evolution clusters using color.

You can play with the accept_thresh parameter. In particular, note that :
For accept_thresh=0 => regular kmeans (constraint is reached immediately)
For accept_thresh = third_var.sum().sum() / (2*K), you might observe that some points that closer to a given centroid are affected to another one for constraint reasons.

CODE :

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
import time

nb_samples = 1000
K = 3  # for demo purpose, used to generate cloud points
c_std = 1.2

# Generate test samples :
points, classes = datasets.make_blobs(n_features=2, n_samples=nb_samples, \
                                      centers=K, cluster_std=c_std)

third_var_distribution = 'cubic_bycluster'  # 'uniform'

if third_var_distribution == 'uniform':
    third_var = np.random.random((nb_samples))
elif third_var_distribution == 'linear_bycluster':
    third_var = np.random.random((nb_samples))
    third_var = third_var * classes
elif third_var_distribution == 'cubic_bycluster':
    third_var = np.random.random((nb_samples))
    third_var = third_var * classes


# Threshold parameters :
# Try with K=3 and :
# T = K => one cluster reach cosntraint, two clusters won't converge
# T = 2K =>
accept_thresh = third_var.sum().sum() / (2*K)


def dist2centroids(points, centroids):
    '''return arrays of ordered points to each centroids
       first array is index of points
       second array is distance to centroid
       dim 0 : centroid
       dim 1 : distance or point index
    '''
    dist = np.sqrt(((points - centroids[:, np.newaxis]) ** 2).sum(axis=2))
    ord_dist_indices = np.argsort(dist, axis=1)

    ord_dist_indices = ord_dist_indices.transpose()
    dist = dist.transpose()

    return ord_dist_indices, dist


def assign_points_with_constraints(inds, dists, tv, accept_thresh):
    assigned = [False] * nb_samples
    assignements = np.ones(nb_samples, dtype=int) * (-1)
    cumul_third_var = np.zeros(K, dtype=float)
    current_inds = np.zeros(K, dtype=int)

    max_round = nb_samples * K

    for round in range(0, max_round):  # we'll break anyway
        # worst advanced cluster in terms of cumulated third_var :
        cluster = np.argmin(cumul_third_var)

        if cumul_third_var[cluster] > accept_thresh:
            continue  # cluster had enough samples

        while current_inds[cluster] < nb_samples:
            # add points to increase cumulated third_var on this cluster
            i_inds = current_inds[cluster]
            closest_pt_index = inds[i_inds][cluster]

            if assigned[closest_pt_index] == True:
                current_inds[cluster] += 1
                continue  # pt already assigned to a cluster

            assignements[closest_pt_index] = cluster
            cumul_third_var[cluster] += tv[closest_pt_index]
            assigned[closest_pt_index] = True
            current_inds[cluster] += 1

            new_cluster = np.argmin(cumul_third_var)
            if new_cluster != cluster:
                break

    return assignements, cumul_third_var


def assign_points_with_kmeans(points, centroids, assignements):
    new_assignements = np.array(assignements, copy=True)

    count = -1
    for asg in assignements:
        count += 1

        if asg > -1:
            continue

        pt = points[count, :]

        distances = np.sqrt(((pt - centroids) ** 2).sum(axis=1))
        centroid = np.argmin(distances)

        new_assignements[count] = centroid

    return new_assignements


def move_centroids(points, labels):
    centroids = np.zeros((K, 2), dtype=float)

    for k in range(0, K):
        centroids[k] = points[assignements == k].mean(axis=0)

    return centroids


rgba_colors = np.zeros((third_var.size, 4))
rgba_colors[:, 0] = 1.0
rgba_colors[:, 3] = 0.1 + (third_var / max(third_var))/1.12
plt.figure(1, figsize=(14, 14))
plt.title("Three blobs", fontsize='small')
plt.scatter(points[:, 0], points[:, 1], marker='o', c=rgba_colors)

# Initialize centroids
centroids = np.random.random((K, 2)) * 10
plt.scatter(centroids[:, 0], centroids[:, 1], marker='X', color='red')

# Step 1 : order points by distance to centroid :
inds, dists = dist2centroids(points, centroids)

# Check if clustering is theoriticaly possible :
tv_sum = third_var.sum()
tv_max = third_var.max()
if (tv_max > 1 / 3 * tv_sum):
    print("No solution to the clustering problem !\n")
    print("For one point : third variable is too high.")
    sys.exit(0)

stop_criter_eps = 0.001
epsilon = 100000
prev_cumdist = 100000

plt.figure(2, figsize=(14, 14))
ln, = plt.plot([])
plt.ion()
plt.show()

while epsilon > stop_criter_eps:

    # Modified kmeans assignment :
    assignements, cumul_third_var = assign_points_with_constraints(inds, dists, third_var, accept_thresh)

    # Kmeans on remaining points :
    assignements = assign_points_with_kmeans(points, centroids, assignements)

    centroids = move_centroids(points, assignements)

    inds, dists = dist2centroids(points, centroids)

    epsilon = np.abs(prev_cumdist - dists.sum().sum())

    print("Delta on error :", epsilon)

    prev_cumdist = dists.sum().sum()

    plt.clf()
    plt.title("Current Assignements", fontsize='small')
    plt.scatter(points[:, 0], points[:, 1], marker='o', c=assignements)
    plt.scatter(centroids[:, 0], centroids[:, 1], marker='o', color='red', linewidths=10)
    plt.text(0,0,"THRESHOLD T = "+str(accept_thresh), va='top', ha='left', color="red", fontsize='x-large')
    for k in range(0, K):
        plt.text(centroids[k, 0], centroids[k, 1] + 0.7, "Ci = "+str(cumul_third_var[k]))
    plt.show()
    plt.pause(1)

Improvements :
- use the distribution of the third variable for assignments.
- manage divergence of the algorithm
- better initialization (kmeans++)