Search code examples
pythonpython-2.7distancesimilaritycosine-similarity

Python: finding score similarity between users within a cluster


How can I calculate similarity between user and score?

For example, df:

    user    score   category_cluster
    i       4.5     category1
    j       5       category1
    k       9.5     category2

I want to have a result like:

similarity between useri_j score in the same category_cluster if not in the same cluster do not compute similarity. How would you measure the similarity?


Solution

  • You will need to define a score function first. Among others, you have manhattan or euclidean distances, which are the probably the most used ones. For more information about distances, I suggest you looking into scikit-learn, they hae a wide variety of distances (metrics) implemented. Look here for a list (you can research later what each of them measure).

    Some of them are distance metrics (how different the elements are, the closest to 0 the more similar) while others measure similarity (like exponential kernels, closer to 1 more similar). Is easy to swap between distance and similarity metrics (being the most basic one distance = 1. - similarity assuming both are in the [0,1] range).

    As for your similarity example similarity[i,j] = 0.9 doesn't make any sense to me. What would be the similarity of i and k? Which formula did you use to get that 0.9? If you clarify it I could provide you with a numpy based representation.

    For direct similarity metrics, have a look here. You can use any of them if they suit your needs. It is explained what each of those measure.


    A example usage of rbf_kernel.

    data = df['score']
    similarity = rbf_kernel(data.reshape(-1, 1), gamma=1.) # Try different values of gamma
    

    gamma here acts like a threshold different values of gamma will make being similar less or more cheap.