Search code examples
pythonpandascluster-analysisk-means

K Means based on mixed type dataframe


I have the following dataset and i want to apply clustering( in particular k-means) on it.

     id      category     value
0    122         A          3
1    122         B          4
2    122         C          9
3    145         A          19
4    145         B          22
5    145         C          90
.
.
. 
197    225         A          16 
198    225         B          17
199    225         C          12

What i want to do is to create cluster of id. For example each cluster should contain some id based on the similarity measure calculated on the category values.

For example: C1 {122, 145, 148} C2{ 225, 222, 221} ....

Any idea on how to deal with this kind of problem?


Solution

  • Pivot your data into the appropriate shape:

    Your categories should be columns, not separate rows.

         id          A          B         C
    1    122         3          4         9
    2    145         19         22        90
    ..
    

    Don't forget to exclude the ID column for analysis! Never include IDs when clustering. For analysis, your data should have only columns A, B, C; one row per ID. So that you have an n x 3 matrix, then you can use k-means just fine.