Clustering algorithm that keeps separated a given set of pairs

I have a clustering problem in which I have to split a set S of samples into C clusters where C is known. Normally, I am able to perform the clustering operation with a simple KMeans clustering, which works just fine.

To complicate things, I have a known set of pairs D of samples that cannot under any circumstances be assinged to the same cluster. Currently I am not using this information and the clustering still works fine, but I would like to introduce it to improve robustness, since it comes for free from the problem I am trying to solve.

Example: S consists of 20 samples with 5 features each, C is 3, and D forces the following pairs {(1, 3), (3, 5), (10, 19)} to be in different clusters.

I am looking for a solution in python3, preferably with numpy/sklearn/scipy. Do you know if there is some out-of-the-box clustering algorithm that takes into account this kind of constraint? I have looked into sklearn but found no such thing.

Solution

This sounds exactly like semi-supervised clustering with pairwise constraints. In it, the unsupervised k-means clustering is augmented by (imperfect) supervision through pairwise constraints for a subset of the data. In your particular example, it is a cannot link-constraint. In addition, must-link constraints could be added as well.

Unfortunately, most implementations I encountered in Python are rather brittle. For example, the Python library active-semi-supervised-clustering allows to add ml (must link) and cl (cannot link) relations just as you describe. The code is:

import numpy as np
from matplotlib import pyplot as plt
from active_semi_clustering.semi_supervised.pairwise_constraints import PCKMeans

# data
S = [[-0.2, -1.0], [3.3, 3.9], [-2.0, 0.6], [2.3, -0.8], [1.1, 1.9], [2.8, -0.3], [4.2, 2.6], [1.8, 6.8], [1.4, -0.7], [2.6, 1.8], [2.6, 5.4], [0.8, -0.6], [3.0, 1.4], [-0.6, -0.4], [0.3, -0.2], [0.8, -0.4], [4.8, 5.1], [2.4, 5.2], [2.3, 5.3], [0.9, 0.3], [2.8, 4.1], [1.4, -0.7], [2.7, 5.6], [0.8, 0.8], [1.9, 5.3], [2.3, 5.3], [2.1, 0.5], [3.1, 5.3], [2.3, 0.8], [-0.2, -0.0], [2.4, 0.0], [3.6, -0.5], [1.3, -0.4], [3.0, 4.6], [0.4, -0.1], [-2.3, -1.4], [-1.9, -1.9], [4.2, 5.4], [-1.3, -0.9], [2.7, 0.2], [1.9, 6.5], [2.8, -0.8], [0.0, -0.3], [3.2, 5.9], [1.7, 4.6], [2.3, -0.3], [2.9, 1.2], [3.5, 2.0], [1.2, 2.3], [2.0, 1.5], [4.2, 5.8], [0.7, -2.0], [-0.8, -0.9], [4.7, 0.7], [-1.2, -1.8], [3.5, 5.1], [2.6, 0.7], [1.1, 3.0], [1.9, 6.5], [2.5, 6.5], [2.2, -0.2], [-0.9, -0.3], [3.1, 4.1], [-0.7, -0.3], [4.1, 5.2], [2.6, 0.8], [4.0, 3.5], [4.2, 4.3], [3.1, 1.1], [0.9, -0.1], [-0.3, 1.2], [0.2, -0.8], [0.1, -1.1], [0.4, -1.1], [-0.1, -0.7]]
S = np.array([np.array(s) for s in S])

# no. of clusters
C = 3

# constraints (indices of points in S)
D = [(1, 3), (3, 5), (10, 19), (7, 11), (4, 6)]

# color plots
colDict = {0: '#fc6A03', 1 : 'green', 2 :'#006699'}

plt.title('Input Data ($S$)', fontsize=20)
plt.scatter(x=[s[0] for s in list(S)], y=[s[1] for s in list(S)], c='darkgrey')
plt.show()

# Naïve Clustering
clust = PCKMeans(n_clusters=C, max_iter=1000)
clust.fit(S, cl=[], ml=[])
plt.title('Naïve (unconstrained) k-Means', fontsize=18)
plt.scatter(x=[s[0] for s in list(S)], y=[s[1] for s in list(S)], c=[colDict[c] for c in clust.labels_])
plt.show()

# Constr. Clustering
const_clust = PCKMeans(n_clusters=C, max_iter=10000)
const_clust.fit(S, ml=[], cl=D)
plt.title('Constrained k-Means', fontsize=18)
plt.scatter(x=[s[0] for s in S.tolist()], y=[s[1] for s in S.tolist()], c=[colDict[c] for c in const_clust.labels_])
plt.show()

which yields

Although the plot looks different, checking if the cannot link-constraints are indeed met results in

[const_clust.labels_[d[0]] != const_clust.labels_[d[1]] for d in D]
>[True, False, True]

indicating that points with index 3 and 5 were assigned the same cluster label. Not good. However, the sample size and the distribution of the data points across the feature space seem to impact this greatly. Potentially, you will see no adverse effects when you apply it to your actual data.

Unfortunately, the repository does not allow to set a seed (to make the iterative estimation procedure reproducible) and ignores the one set via np.random.seed(567). Beware of reproducibility and rerun the code several times.

Other repositories such as scikit-learn indicate that some clustering routines may allow constraints but don't indicate how this can be done.

Note that there are other variants of constrained k-means clustering of this, e.g. where the pairwise constraints are not certain (see this reference) or the number of data points per cluster is constrained (see this python library).