I am coding cosine similarity in PHP. Sometimes the formula gives a result above one. In order to derive a degree from this number using inverse cos, it needs to be between 1 and 0.
I know that I don't need a degree, as the closer it is to 1, the more similar they are, and the closer to 0 the less similar.
However, I don't know what to make of a number above 1. Does it just mean it is totally dissimilar? Is 2 less similar than 0?
Could you say that the order of similarity kind of goes:
Closest to 1 from below down to 0 - most similar as it moves from 0 to one. Closest to 1 from above - less and less similar the further away it gets.
Thank you!
My code, as requested is:
$norm1 = 0;
foreach ($dict1 as $value) {
$valuesq = $value * $value;
$norm1 = $norm1 + $valuesq;
}
$norm1 = sqrt($norm1);
$dot_product = array_sum(array_map('bcmul', $dict1, $dict2));
$cospheta = ($dot_product)/($norm1*$norm2);
To give you an idea of the kinds of values I'm getting:
0.9076645291077
2.0680991116095
1.4015600717928
1.0377360186767
1.8563586243689
1.0349674872379
1.2083865384822
2.3000034036913
0.84280491429133
Your math is good but I'm thinking you're missing something calculating the norms. It works great if you move that math to its own function as follows:
<?php
function calc_norm($arr) {
$norm = 0;
foreach ($arr as $value) {
$valuesq = $value * $value;
$norm = $norm + $valuesq;
}
return(sqrt($norm));
}
$dict1 = array(5,0,97);
$dict2 = array(300,2,124);
$dot_product = array_sum(array_map('bcmul', $dict1, $dict2));
$cospheta = ($dot_product)/(calc_norm($dict1)*calc_norm($dict2));
print_r($cospheta);
?>