Search code examples
algorithmsimilarityeuclidean-distancecosine-similarity

Calculating similarity based on attributes


My objective is to calculate the degree of similarity between two users based on their attributes. For instance let's consider a player and consider age, salary, and points as attributes.

Also I want to place weight on each attribute by order of importance. In my case age is a more important attribute than salary and points. So for instance let's assume we calculate the similarity using the euclidean distance.

Given user 1 who is age 20, salary 50, points scored 100

Given user 2 who is age 24, salary 60, points scored 85

Given user 3 who is age 19, salary 62, points scored 80

To compute the similarity between user 1 and user 2 I could do

sqrt of( (20-24)^2 + (60-50)^2 + (85-100)^2 )

Now we want to also add the weights so in euclidean distance the lower the number the more closer two objects are in terms of similaraity. As mentioned earlier since age is the most important so we will assign weights as follows

sqrt of( 0.60*(20-24)^2 + 0.20*(60-50)^2 + 0.20*(85-100)^2 )

Is my approach correct ? Also should i be considering other algorithms such as cosine similarity to calculate similarity?


Solution

  • I am currently working on a project that involves calculating measurements between different entities so I am familiar with your problem.

    In your case good thing is that you don't have features of various , mixed types (e.g. text or categorical etc..) . Age ,salary and points are numbers and as already mentioned in the comments the first thing you should do is normalization. It's a "must do" because if you don't do it then there is a danger that one feature will be dominant when calculating distance.

    You have to be careful and check your data and clean if necessary. e.g. bad value where age is 200 will mess up your normalization and majority of scaled age values will end up in the lower part (closer to zero).

    You are right regarding weight and calculating the weighted euclidean. These weights have sum value of 1 (as you have showed in the example 0.6+0.2+0.2 = 1 ).

    Regarding which distance metrics to use it's a good question. There are bunch of them. e.g. check https://docs.scipy.org/doc/scipy/reference/spatial.distance.html

    But based on my experience I would choose euclidean although you should try few and check how it works on your data.