I am working on a clustering problem of social network profiles and each profile document is represented by number of times the 'term of interest occurs' in the profile description. To do clustering effectively, I am trying to find the correct similarity measure (or distance function) between two of the profiles.
So lets say I have following table of profiles
basketball cricket python
profile1 4 2 1
profile2 2 1 3
profile3 2 1 0
Now, going by calculating euclidean distance, I get
distance (profile1,profile2) = 3
distance (profile2,profile3) = 3
distance (profile3,profile1) = 2.45
Now, this is fine, but there are two questions coming to my mind
Here we are disregarding number of features that are common, for example, even though profile 1 and profile 3 are nearest, going by human intuition, profile 1 and profile 2 at least have some value in all three interests -basketball, cricket and python and hence these two profiles likely be more similar rather than profile 1 and profile 3 where one of them(profile 3) does not mention python in profile. I also don't want just count of similar features for distance which will yield surely wrong results.
My first question - Is there any way I can accommodate this intuition by any of the established ways?
My second question - there can be some profile authors more verbose than others, how to adjust this? because verbose author of profile having 4 occurrences of python may be same as less verbose author 2 occurrences of python.
I was not able to come up with good title for the question. So sorry if its confusing.
First, compute your profiles as you already did. Then the crucial step will be some kind of normalization. You can either divide the numbers by their total so that the numbers sum up to 1, or you can divide them by their Euclidean norm so that they have Euclidean norm 1.
For instance, using sum normalization, the first profile would become (rounded)
0.57, 0.29, 0.14
and using Euclidean normalization, it would become
0.87, 0.44, 0.22
This will make sure that all profiles are represented in the same numeric range, and will take care of the "overly verbose profile author".
Below is an example IPython session that shows how to normalize the rows by row sum and how to compute the Euclidean distances between the normalized rows. You will see that after normalization, profiles 1 and 3 are much closer together, as you would expect.
In [22]: p = array([[4,2,1],[2,1,3],[2,1,0]])
In [23]: p
Out[23]:
array([[4, 2, 1],
[2, 1, 3],
[2, 1, 0]])
In [24]: p = p / p.sum(axis=1)[:,newaxis]
In [25]: p
Out[25]:
array([[ 0.57142857, 0.28571429, 0.14285714],
[ 0.33333333, 0.16666667, 0.5 ],
[ 0.66666667, 0.33333333, 0. ]])
In [26]: p.sum(axis=1)
Out[26]: array([ 1., 1., 1.])
In [27]: norm(p[0] - p[1]) # distance 1-2
Out[27]: 0.44543540318737401
In [28]: norm(p[0] - p[2]) # distance 1-3
Out[28]: 0.17817416127494959
In [29]: norm(p[1] - p[2]) # distance 2-3
Out[29]: 0.62360956446232352
Finally, if you want to put more weight on whether a profile mentions an interest at all and not so much on how often it mentioned it, you can do an extra step before normalization: simply compute pow(x, alpha)
for every element x
of your profile vectors, where alpha
is a parameter between 0 and 1. Here, 1 means standard linear weighting as before, and as you make alpha close to 0, it means only mentioning the interest counts, not how often. For instance, with alpha = 0.5
(taking the square root of the profiles), we get:
In [32]: p = array([[4,2,1],[2,1,3],[2,1,0]])
In [33]: p = sqrt(p)
In [34]: p
Out[34]:
array([[ 2. , 1.41421356, 1. ],
[ 1.41421356, 1. , 1.73205081],
[ 1.41421356, 1. , 0. ]])
In [35]: p = p / p.sum(axis=1)[:,newaxis]
In [37]: norm(p[0] - p[1]) # distance 1-2
Out[37]: 0.2353133053319465
In [38]: norm(p[0] - p[2]) # distance 1-3
Out[38]: 0.27881275777438091
In [39]: norm(p[1] - p[2]) # distance 2-3
Out[39]: 0.51412606310632747
Now profiles 1 and 2 are the closest match since we put more emphasis on the fact that they both mention Python, not so much on how often they mention it.