Search code examples
pythonminhashlsh

LSH Binning On-The-Fly


I want to use MinHash LSH to bin a large number of documents into buckets of similar documents (Jaccard similarity).

The question: Is it possible to compute the bucket of a MinHash without knowing about the MinHash of the other documents?

As far as I understand LSH "just" computes a hash of the MinHashes. So it should be possible?

One implementation I find quite promissing is datasketch. I can query the LSH for documents similar to a given one after knowing the MinHash of all documents. However I see no way to get the bucket of a single document before knowing the other ones. https://ekzhu.github.io/datasketch/index.html


Solution

  • LSH doesn't bucket entire documents, nor does it bucket individual minhashes. Rather, it buckets 'bands' of minhashes.

    LSH is a means of both reducing the number of hashes stored per document, and reducing the number of hits found when using these hashes to search for similar documents. It achieves this by combining multiple minhashes together into a single hash. So, for example, instead of storing 200 minhashes per document, you might combine them in bands of four to yield 50 locality sensitive hashes.

    The hash for each band is calculated from its constituent minhashes using a cheap hash function such as FNV-1a. This loses some information, which is why LSH is said to reduce the dimensionality of the data. The resulting hash is the bucket.

    So the bucket for each band of minhashes within a document is calculated without requiring knowledge of any other bands or any other documents.

    Using LSH hashes to find similar documents is simple: Let's say you want to find documents similar to document A. First generate the (e.g.) 50 LSH hashes for document A. Then look in your hash dictionary for all other documents that share one or more of these hashes. The more hashes they share, the higher their estimated jaccard similarity (though this is not a linear relationship, as it is when using plain minhashes).

    The fewer total hashes stored per document, the greater the error in estimated jaccard similarity, and the greater the chance of missing similar documents.

    Here's a good explanation of LSH.