How can I calculate similarity between user
and score
?
For example, df
:
user score category_cluster
i 4.5 category1
j 5 category1
k 9.5 category2
I want to have a result like:
similarity between useri_j
score in the same category_cluster
if not in the same cluster do not compute similarity. How would you measure the similarity?
You will need to define a score function first. Among others, you have manhattan or euclidean distances, which are the probably the most used ones. For more information about distances, I suggest you looking into scikit-learn
, they hae a wide variety of distances (metrics) implemented. Look here for a list (you can research later what each of them measure).
Some of them are distance metrics (how different the elements are, the closest to 0 the more similar) while others measure similarity (like exponential kernels, closer to 1 more similar). Is easy to swap between distance and similarity metrics (being the most basic one distance = 1. - similarity
assuming both are in the [0,1]
range).
As for your similarity example similarity[i,j] = 0.9
doesn't make any sense to me. What would be the similarity of i and k
? Which formula did you use to get that 0.9
? If you clarify it I could provide you with a numpy based representation.
For direct similarity metrics, have a look here. You can use any of them if they suit your needs. It is explained what each of those measure.
A example usage of rbf_kernel
.
data = df['score']
similarity = rbf_kernel(data.reshape(-1, 1), gamma=1.) # Try different values of gamma
gamma
here acts like a threshold different values of gamma
will make being similar less or more cheap.