Search code examples
scikit-learnsubsetcluster-analysis

How can I separate an array of numbers into two clusters and return two subsets of corresponding indexes?


I have an array of scalar numbers, pm, and a list of indexes, idx, so pm[idx] is a subset of pm. How can I separate pm[idx] into two clusters (according to the Euclidean distance) and obtain two sets of corresponding indexes (ideally using scikit-learn)?

For example,

pm = array([0,1,2,3,4,100,105])
idx = [0,2,3,5,6]

How can I obtain the idx1 = [0,2,3] and idx2 = [5,6]?


Solution

  • basically you want to filter your data pm which can be easily done with your idx array. You can cluster your filtered data to obtain two groups.

    Partition based clustering algorithms such as k-Means or SingleLink can be perfectly applied. In scikit-learn you could use /sklearn.cluster.AgglomerativeClustering.

    As those clustering algorithms expects your data to have the features in columns and the instances as rows your need to reshape your data.

    From the resulting cluster labels you can then create separate index arrays using a list comprehension. (didn't found a numpy function that does the same)

    Your solution could look like the following:

    cluster_algorithm = AgglomerativeClustering(n_clusters=2)
    labels = cluster_algorithm.fit_predict(np.expand_dims(pm[idx], axis=-1))
    
    print(labels)
    >>> [1 1 1 0 0]
    
    idx_labels = [np.where(labels == e)[0] for e in set(labels)]
    idx_labels  # [array([3, 4], dtype=int64), array([0, 1, 2], dtype=int64)]