Search code examples
pythonpandasdataframek-means

KMeans Clustering rows in a DataFrame with many columns (integers)


I have a DataFrame that's comprised of 0's and 1's in each row, the idea is to compare and cluster all the rows in each df with a specific amount of clusters (in this case let's say 5).

What I need to get is the row indexes for each of the 5 clusters (or .groupby by cluster with the original row index).

The df looks like this:

    0   1   2   3   4   5   6   7   8   9   ... 528 529 530 531 532 533 534 535 536 537
0   0   0   0   0   0   0   0   1   1   1   ... 0   1   1   1   0   0   0   1   0   1
1   0   0   0   0   0   0   0   1   1   1   ... 0   1   1   1   0   0   0   1   0   1
2   0   0   0   0   0   0   0   1   1   1   ... 0   1   1   1   0   0   0   1   0   1
3   0   0   0   0   0   0   0   0   0   0   ... 0   0   0   1   0   0   0   0   1   1
4   0   0   0   0   0   0   0   0   0   0   ... 0   0   0   1   0   0   0   0   0   1
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
137 0   0   0   0   0   0   0   0   0   0   ... 0   0   0   0   0   1   0   0   0   0
138 1   1   0   0   0   0   0   0   0   1   ... 0   0   0   0   0   1   0   0   0   0
139 1   1   1   0   0   0   0   0   0   0   ... 0   0   0   0   0   1   0   0   0   0
140 1   1   0   0   0   0   0   0   0   1   ... 0   0   0   0   0   1   0   0   0   0
141 1   1   1   0   0   0   0   0   0   0   ... 0   0   0   0   0   1   0   0   0   0

I found another answer that provides this solution here: Kmeans Cluster for each group in pandas dataframe and assign clusters

def cluster(X):
k_means = KMeans(n_clusters=5).fit(X)
return X.groupby(k_means.labels_)\
        .transform('mean').sum(1)\
        .rank(method='dense').sub(1)\
        .astype(int).to_frame()

And the result I'm getting is:

    0
0   1
1   1
2   1
3   0
4   0
... ...
137 3
138 1
139 3
140 3
141 3

But to be fair I don't entirely understand what it does and if the result I am getting here is the cluster number for each row


Solution

  • I'm not entirely sure what your example piece does either, but for your use case something like this would work. First, get the cluster labels:

    from sklearn.cluster import KMeans
    
    df["cluster"] = KMeans(n_clusters=5).fit(df).labels_
    

    And then if you needed to do something with the indices of each cluster, you can for example get them as a dict with groupby("cluster").indices

    >>> df.groupby("cluster").indices
    {0: array([0, 1]), 1: array([2, 3]), ...}