I have an array of scalar numbers, pm
, and a list of indexes, idx
, so pm[idx]
is a subset of pm
. How can I separate pm[idx]
into two clusters (according to the Euclidean distance) and obtain two sets of corresponding indexes (ideally using scikit-learn)?
For example,
pm = array([0,1,2,3,4,100,105])
idx = [0,2,3,5,6]
How can I obtain the idx1 = [0,2,3]
and idx2 = [5,6]
?
basically you want to filter your data pm
which can be easily done with your idx array. You can cluster your filtered data to obtain two groups.
Partition based clustering algorithms such as k-Means or SingleLink can be perfectly applied. In scikit-learn
you could use /sklearn.cluster.AgglomerativeClustering
.
As those clustering algorithms expects your data to have the features in columns and the instances as rows your need to reshape your data.
From the resulting cluster labels you can then create separate index arrays using a list comprehension. (didn't found a numpy function that does the same)
Your solution could look like the following:
cluster_algorithm = AgglomerativeClustering(n_clusters=2)
labels = cluster_algorithm.fit_predict(np.expand_dims(pm[idx], axis=-1))
print(labels)
>>> [1 1 1 0 0]
idx_labels = [np.where(labels == e)[0] for e in set(labels)]
idx_labels # [array([3, 4], dtype=int64), array([0, 1, 2], dtype=int64)]