Vectors like this
v1 = {0 0 0 1 1 0 0 1 0 1 1}
v2 = {0 1 1 1 1 1 0 1 0 1 0}
v3 = {0 0 0 0 0 0 0 0 0 0 1}
Need to calculate similarity between them. Hamming distance between v1
and v2
is 4 and between v1
and v3
is also 4. But because I am interested in the groups of '1' which are together for me v2
is far more similar to v1
then v3
is.
Is there any distance metrics that can capture this in the data?
The data represent occupancy of house in time, that's why it is important to me. '1' means occupied, '0' means non occupied.
It sounds like you need cosine similarity measure:
similarity = cos(v1, v2) = v1 * v2 / (|v1| |v2|)
where v1 * v2
is dot product between v1
and v2
:
v1 * v2 = v1[1]*v2[1] + v1[2]*v2[2] + ... + v1[n]*v2[n]
Essentially, dot product shows how many elements in both vectors have 1 at the same position: if v1[k] == 1
and v2[k] == 1
, then final sum (and thus similarity) is increased, otherwise it isn't changed.
You can use dot product itself, but sometimes you would want final similarity to be normalized, e.g. be between 0 and 1. In this case you can divide dot product of v1
and v2
by their lengths - |v1|
and |v2|
. Essentially, vector length is square root of dot product of the vector with itself:
|v| = sqrt(v[1]*v[1] + v[2]*v[2] + ... + v[n]*v[n])
Having all of these, it's easy to implement cosine distance as follows (example in Python):
from math import sqrt
def dot(v1, v2):
return sum(x*y for x, y in zip(v1, v2))
def length(v):
return sqrt(dot(v, v))
def sim(v1, v2):
return dot(v1, v2) / (length(v1) * length(v2))
Note, that I described similarity (how much two vectors are close to each other), not distance (how far they are). If you need exactly distance, you can calculate it as dist = 1 / sim
.