Search code examples
javascriptsimilaritytrigonometryeuclidean-distance

Javascript Clusterfck Metric


So I am converting an old data visualization to a new platform and I am a little bit stuck on their community sorting feature. In the original code, it looks like the author uses agglomerative clustering with a cosine similarity calculator. I figured the best way to approach this in Javascript would be to make a tree with clusterfck, using my custom cosine similarity function as the metric. The tree sorts ALMOST correctly for each set of data I pass. (But due to project specifications, "almost" isn't good enough). I checked my algorithm and everything looks right, but when I compare my results using cosine similariy and euclidean distance, I get the same sorting result.

What could be causing this? I think I may be passing something incorrectly and clusterfck is passing euclidean as a default. Below is a chunk of my code. Can someone verify? (Also, is there an easier way to calculate cosine similarity? I don't think JS has a built in dot product).

clusters = clusterfck.hcluster(relationArray, clusterfck.cosSim2, clusterfck.SINGLE_LINKAGE);
postOrder(clusters);
function postOrder(t) {
i++;
if (t == null) {
    return;
} else {
    postOrder(t.left);
    postOrder(t.right);
    if (t.left == null && t.right == null) {
        communityArr.push(t.canonical[0]);
    } else {
        return;
    }
}
}

function cosSim2(arr1, arr2) {
var d1 = 0,
    d2 = 0,
    cos = 0;
for(var i = 0; i < arr1.length; i++) {
    d1 += Math.pow(arr1[i], 2);
}

for(var j = 0; j < arr2.length; j++) {
    d2 += Math.pow(arr2[j], 2);
}

d1 = Math.sqrt(d1);
d2 = Math.sqrt(d2);

for(var j = 0; j < arr2.length; j++) {
    if (arr1[j] == null) {
        cos += 0;
    } else {
        cos += arr1[j] * arr2[j];
    }
}
var cosSimilarity = cos / (d1 * d2);
return cosSimilarity;
}

Solution

  • I suppose this response is too late for you. But in case somebody else stumbles across this:

    The problem is that you call clusterfck.hcluster with the parameter clusterfck.cosSim2 as your distance measure. But as your real distance function is simply cosSim2, you effectively call clusterfck.hcluster with an undefined distance function, and clusterfck resorts to the default distance function which is "euclidean"...

    Also please note that your function indeed measures the similarity between vectors, not their distance. That is: The greater the cosine similarity, the more similar the vectors (i.e., the smaller the angle between them).

    But clusterfck.hcluster expects a genuine distance measure. That is, the opposite is supposed to be true: The greater the value of the distance measure, the more distant the vectors (i.e., the less similar the vectors).

    Calling clusterfck.hcluster with your function will have the effect, that the least similar items are clustered together.

    You can easily derive a distance function from your cosine similarity function as follows:

    function cosDist(arr1, arr2) {
        return 1 - cosSim2(arr1, arr2);
    }
    

    This new function cosDist has values ranging from 0 to 2, identical vectors will have the distance 0 (as expected) and the most distant (i.e. dissimilar) ones will have the distance 2.

    And another note: The Wikipedia article http://en.wikipedia.org/wiki/Cosine_similarity points out that this cosDist is not a proper distance metric in the mathematical sense (the triangle inequality does not generally hold here) but from my experience this is not a problem in practice when using this function for hierarchical clustering. And it is used that way often. Nevertheless there is a way to derive a genuine distance metric from cosine, also mentioned in the same Wikipedia article: https://en.wikipedia.org/wiki/Cosine_similarity#Angular_distance_and_similarity