Search code examples
apache-sparkmachine-learningcluster-analysisapache-spark-mllibcosine-similarity

Cluster Scenario: Difference between the computedCost of 2 points used as similarity measure between points. Is it applicable?


I want to have a measure of similarity between two points in a cluster. Would the similarity calculated this way be an acceptable measure of similarity between the two datapoint?

Say I have to vectors: vector A and vector B that are in the same cluster. I have trained a cluster which is denoted by model and then model.computeCost() computes thesquared distance between the input point and the corresponding cluster center.

(I am using Apache Spark MLlib)

val costA = model.computeCost(A)
val costB = model.computeCost(B)

val dissimilarity = |cost(A)-cost(B)|

Dissimilarity i.e. the higher the value, the more unlike each other they are.


Solution

  • If you are just asking is this a valid metric then the answer is almost, it is a valid pseudometric if only .computeCost is deterministic.

    For simplicity i denote f(A) := model.computeCost(A) and d(A, B) := |f(A)-f(B)|

    Short proof: d is a L1 applied to an image of some function, thus is a pseudometric itself, and a metric if f is injective (in general, yours is not).

    Long(er) proof:

    • d(A,B) >= 0 yes, since |f(A) - f(B)| >= 0
    • d(A,B) = d(B,A) yes, since |f(A) - f(B)| = |f(B) - f(A)|
    • d(A,B) = 0 iff A=B, no, this is why it is pseudometric, since you can have many A != B such that f(A) = f(B)
    • d(A,B) + d(B,C) <= d(A,C), yes, directly from the same inequality for absolute values.

    If you are asking will it work for your problem, then the answer is it might, depends on the problem. There is no way to answer this without analysis of your problem and data. As shown above this is a valid pseudometric, thus it will measure something decently behaving from mathematical perspective. Will it work for your particular case is completely different story. The good thing is most of the algorithms which work for metrics will work with pseudometrics as well. The only difference is that you simply "glue together" points which have the same image (f(A)=f(B)), if this is not the issue for your problem - then you can apply this kind of pseudometric in any metric-based reasoning without any problems. In practise, that means that if your f is

    computes the sum of squared distances between the input point and the corresponding cluster center

    this means that this is actually a distance to closest center (there is no summation involved when you consider a single point). This would mean, that 2 points in two separate clusters are considered identical when they are equally far away from their own clusters centers. Consequently your measure captures "how different are relations of points and their respective clusters". This is a well defined, indirect dissimilarity computation, however you have to be fully aware what is happening before applying it (since it will have specific consequences).