I have the following dataset and i want to apply clustering( in particular k-means) on it.
id category value
0 122 A 3
1 122 B 4
2 122 C 9
3 145 A 19
4 145 B 22
5 145 C 90
.
.
.
197 225 A 16
198 225 B 17
199 225 C 12
What i want to do is to create cluster of id. For example each cluster should contain some id based on the similarity measure calculated on the category values.
For example: C1 {122, 145, 148} C2{ 225, 222, 221} ....
Any idea on how to deal with this kind of problem?
Pivot your data into the appropriate shape:
Your categories should be columns, not separate rows.
id A B C
1 122 3 4 9
2 145 19 22 90
..
Don't forget to exclude the ID column for analysis! Never include IDs when clustering. For analysis, your data should have only columns A, B, C; one row per ID. So that you have an n x 3 matrix, then you can use k-means just fine.