python machine-learning cluster-analysis k-means

Clustering an array of values without using thresholds

I want to segment A 1D dataset where each value represents an error into 2 segments:

A cluster with the smallest values
All the others

Example:

X = np.array([1, 1.5, 0.4, 1.1, 23, 24, 22.5, 21, 20, 25, 40, 50, 50, 51, 52, 53]).reshape(-1, 1)

In this small example, I would like to regroup the 4 first values in a cluster and forget about the others. I do not want a solution based on a threshold. The point is that the cluster of interest centroid will not always have the same value. It might be 1e-6, or it might be 1e-3, or it might be 1.

My idea was to use a k-means clustering algorithm, which would work fine if I did know how many clusters existed in my data. In the example above, the number is 3, one around 1 (the cluster of interest), one around 22, and one around 51. But sadly, I do not know the number of clusters... Simply searching for 2 clusters will not lead to a segmentation of the dataset as intended.

kmeans = KMeans(n_clusters=2, random_state=0).fit(X)
kmeans.labels_

Returns a cluster 1 way too large, which also includes the data from the cluster centered around 22.

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0])

I did find some interesting answers on methods to select the k, but it complexifies the algorithm and I feel like there must be a far better way to solve this problem.

I'm open to any suggestions and example which could work on the X array provided.

Solution

You might find AffinityPropagation useful here, as it does not require to specify the amount of clusters to generate. You might have to tune however the damping factor and preference, so that it produces the expected results.

On the provided example, the default parameters seem to do the job:

from sklearn.cluster import AffinityPropagation

X = np.array([1, 1.5, 0.4, 1.1, 23, 24, 22.5, 
              21, 20, 25, 40, 50, 50, 51, 52, 53]).reshape(-1, 1)
ap = AffinityPropagation(random_state=12).fit(X)
y = ap.predict(X)
print(y)
# array([0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2], dtype=int64)

To obtain individual clusters from X, you can index using y:

first_cluster = X[y==0].ravel()
first_cluster
# array([1. , 1.5, 0.4, 1.1])
second_cluster = X[y==1].ravel()
second_cluster
# array([23. , 24. , 22.5, 21. , 20. , 25. ])