Search code examples
phpinformation-retrievalcosine-similarity

Cosine similarity result above one


I am coding cosine similarity in PHP. Sometimes the formula gives a result above one. In order to derive a degree from this number using inverse cos, it needs to be between 1 and 0.

I know that I don't need a degree, as the closer it is to 1, the more similar they are, and the closer to 0 the less similar.

However, I don't know what to make of a number above 1. Does it just mean it is totally dissimilar? Is 2 less similar than 0?

Could you say that the order of similarity kind of goes:

Closest to 1 from below down to 0 - most similar as it moves from 0 to one. Closest to 1 from above - less and less similar the further away it gets.

Thank you!

My code, as requested is:

$norm1 = 0;
foreach ($dict1 as $value) {
    $valuesq = $value * $value;
    $norm1 = $norm1 + $valuesq;
}
$norm1 = sqrt($norm1);
$dot_product = array_sum(array_map('bcmul', $dict1, $dict2));
$cospheta = ($dot_product)/($norm1*$norm2);

To give you an idea of the kinds of values I'm getting:

0.9076645291077

2.0680991116095

1.4015600717928

1.0377360186767

1.8563586243689

1.0349674872379

1.2083865384822

2.3000034036913

0.84280491429133 

Solution

  • Your math is good but I'm thinking you're missing something calculating the norms. It works great if you move that math to its own function as follows:

    <?php
    function calc_norm($arr) {
        $norm = 0;
        foreach ($arr as $value) {
            $valuesq = $value * $value;
            $norm = $norm + $valuesq;
        }
        return(sqrt($norm));
    }
    
    $dict1 = array(5,0,97);
    $dict2 = array(300,2,124);
    
    $dot_product = array_sum(array_map('bcmul', $dict1, $dict2));
    $cospheta = ($dot_product)/(calc_norm($dict1)*calc_norm($dict2));
    
    print_r($cospheta);
    

    ?>