Search code examples
algorithmmatchk-means

kmeans algorithm for same users


I hope you have a nice day. I have users on my database and in this database, users have features that could be similar with other users. For example:

user1 has a,b,c,d,g feature
user2 has a,b,c,e feature
user3 has b,c,f feature
user4 has c feature
...

I want to write an algorithm that will separate users into groups of 4,and those within the group should be the most optimized with similar features, how to use kmeans for that? or do i need to use another algorithm , any ideas ?


Solution

  • k-means might not be the best fit for this, but you may try it by changing your discrete variables (feature on/off) into continuous variables.

    Each feature would then have its own dimension and values could be 1,0 (feature present) or 0,0 (feature missing). In your case, you appear to have at least 7 feature dimensions (a-g). So you would run k-means in 7-dimensional space.

    It might be a good idea at looking into algorithms that may be better adapted to your scenario. For example, hierarchical clustering. There you can apply Manhattan distance instead of Euclidean which is used by k-means. The Manhattan distance is a better fit for your n-dimensional grid feature space. Try single-linkage and complete-linkage first. After the hierarchy is computed, you can simply extract a cut in the tree that gives you the desired amount of clusters.

    See https://scikit-learn.org/stable/modules/clustering.html#hierarchical-clustering for a good overview of this algorithm. Simple to implement and try out of you are a bit familiar with Python.