Search code examples
pythonk-meansknnhierarchical-clustering

KMeans Clustering using Python


I have a dataset

Name    System
A       AZ
A       NaN
B       AZ
B       NaN
B       NaN
C       AY
C       AY
D       AZ
E       AY
E       AY
E       NaN
F       AZ
F       AZ
F       NaN

Using this dataset, I need to cluster the dataset based on the number of times "System" is repeated for a particular "Name".

In the above example, Names A, B and D have one "AZ" "Subset" while C, E have two "AY" subsets and F has two AZ so it is a different cluster.

Output Example:

Cluster     Names
AZ          A,B
AY,AY       C,E
AZ,AZ       F 

PS. Actual dataset may vary in number of rows and columns How can I do it using ML based clustering algorithms like KNN, Naive Bayes, etc? I need two approaches, one without ignoring NaN, one ignoring NaN.


Solution

  • IIUC, this looks like a double groupby. You first need to group by Name and System and aggregate System to form the clusters. Then this is a simple groupby with aggregation as string.

    (df.groupby(['Name', 'System'])
       ['System'].agg(Cluster=','.join)          # clusters of repeats
       .droplevel('System').reset_index()
       .groupby('Cluster')['Name'].agg(','.join) # aggregate by cluster
       .reset_index()
    )
    

    output:

      Cluster   Name
    0   AY,AY    C,E
    1      AZ  A,B,D
    2   AZ,AZ      F
    

    NB. I used aggregation with ','.join here but you could also use tuple or a custom function (first groupby) and frozenset (second groupby) to keep access to the individual elements

    (df.groupby(['Name', 'System'], dropna=False)
       ['System'].agg(Cluster=lambda x: (x.iloc[0], len(x)))
       .droplevel('System').reset_index()
       .groupby('Cluster')['Name'].agg(frozenset)
       .reset_index()
    )
    

    output:

        Cluster       Name
    0   (AY, 2)     (E, C)
    1   (AZ, 1)  (B, A, D)
    2   (AZ, 2)        (F)
    3  (nan, 1)  (E, F, A)
    4  (nan, 2)        (B)