Search code examples
javatext-miningcosine-similarity

Cosine similarity returning wrong distance


I have two vectors represented as a HashMap and I want to measure the similarity between them. I use the cosine similarity metric as in the following code:

public static void cosineSimilarity(HashMap<Integer,Double> vector1, HashMap<Integer,Double> vector2){
double scalar=0.0d, v1Norm=0.0d, v2Norm=0.0d;

for(int featureId: vector1.keySet()){
   scalar+= (vector1.get(featureId)* vector2.get(featureId));
   v1Norm+= (vector1.get(featureId) * vector1.get(featureId));
   v2Norm+= (vector2.get(featureId) * vector2.get(featureId));
}

 v1Norm=Math.sqrt(v1Norm);
 v2Norm=Math.sqrt(v2Norm);

 double cosine= scalar / (v1Norm*v2Norm);
 System.out.println("v1 is: "+v1Norm+" , v2 is: "+v2Norm+" Cosine is: "+cosine);    
}

Strangely, two vectors that are supposed to be dissimilar come close to .9999 result which is just wrong!

Please note that the keys are exactly the same for both maps.

data file is here: file

File format:

FeatureId vector1_value vector2_value


Solution

  • Your code is fine.

    The vectors are dominated by several large features. In those features, the two vectors are almost collinear, which is why the similarity measure is close to 1.

    I include the six largest features below. Look at the ratio of vec2 over vec1: it's almost identical across those features.

    feature     vec1    vec2        vec2/vec1
    
    64806110    2875    1.85E+07    6.43E+03
    64806108    5750    3.68E+07    6.40E+03
    64806107    8625    5.49E+07    6.37E+03
    64806106    11500   7.29E+07    6.34E+03
    64806111    14375   9.07E+07    6.31E+03
    64806109    17250   1.08E+08    6.28E+03