I have two vectors represented as a HashMap and I want to measure the similarity between them. I use the cosine similarity metric as in the following code:
public static void cosineSimilarity(HashMap<Integer,Double> vector1, HashMap<Integer,Double> vector2){
double scalar=0.0d, v1Norm=0.0d, v2Norm=0.0d;
for(int featureId: vector1.keySet()){
scalar+= (vector1.get(featureId)* vector2.get(featureId));
v1Norm+= (vector1.get(featureId) * vector1.get(featureId));
v2Norm+= (vector2.get(featureId) * vector2.get(featureId));
}
v1Norm=Math.sqrt(v1Norm);
v2Norm=Math.sqrt(v2Norm);
double cosine= scalar / (v1Norm*v2Norm);
System.out.println("v1 is: "+v1Norm+" , v2 is: "+v2Norm+" Cosine is: "+cosine);
}
Strangely, two vectors that are supposed to be dissimilar come close to .9999 result which is just wrong!
Please note that the keys are exactly the same for both maps.
data file is here: file
File format:
FeatureId vector1_value vector2_value
Your code is fine.
The vectors are dominated by several large features. In those features, the two vectors are almost collinear, which is why the similarity measure is close to 1
.
I include the six largest features below. Look at the ratio of vec2
over vec1
: it's almost identical across those features.
feature vec1 vec2 vec2/vec1
64806110 2875 1.85E+07 6.43E+03
64806108 5750 3.68E+07 6.40E+03
64806107 8625 5.49E+07 6.37E+03
64806106 11500 7.29E+07 6.34E+03
64806111 14375 9.07E+07 6.31E+03
64806109 17250 1.08E+08 6.28E+03