Search code examples
algorithmhashknnlocality-sensitive-hashapproximate-nn-searching

Why k and l for LSH used for approximate nearest neighbours?


In all the Locality Sensitive Hashing explanations (i.e. http://en.wikipedia.org/wiki/Locality-sensitive_hashing#LSH_algorithm_for_nearest_neighbor_search )

They describe that k Hash Functions are generated, but only l (l < k) are used in the hash tables to hash the values.

Why generating k at all and not just generate l?

Why the seperate factors k and l?

I don't understand it.


Solution

  • All of the hash functions are in fact used. This makes more sense if you remember that, for example, in the section "Bit sampling for Hamming distance" an individual hash function might simply return a single bit. In fact another example of an LSH hash function is to consider a randomly chosen plane in some d-dimensional place and to return 0 or 1 according to which side of the plane the point being hashed is.

    To address a single table, because the hash functions may return just a single bit, you evaluate k hash functions and concatenate the result, to give you a perhaps a k-bit key. Now with l tables you need l different keys so in fact you need a total of l*k hash functions.

    Check: look at the success probability. When looking up a single table a single hash function returns the same value for the query and the near neighbour with probability P1. To find the near neighbour in a single table you must get all the hash functions to work, so that probability is P1^k and that single lookup fails with probability 1 - P1^k. But you try this l times so the probability that all lookups fail is (1-P1^k)^l and the success probability is 1-(1-P1^k)^l, which is exactly what they calculate.