Search code examples
rmd5minhashropenscilsh

Why does textreuse packge in R make LSH buckets way larger than the original minhashes?


As far as I understand one of the main functions of the LSH method is data reduction even beyond the underlying hashes (often minhashes). I have been using the textreuse package in R, and I am surprised by the size of the data it generates. textreuse is a peer-reviewed ROpenSci package, so I assume it does its job correctly, but my question persists.

Let's say I use 256 permutations and 64 bands for my minhash and LSH functions respectively -- realistic values that are often used to detect with relative certainty (~98%) similarities as low as 50%.

If I hash a random text file using TextReuseTextDocument (256 perms) and assign it to trtd, I will have:

object.size(trtd$minhashes)
> 1072 bytes

Now let's create the LSH buckets for this object (64 bands) and assign it to l, I will have:

object.size(l$buckets)
> 6704 bytes

So, the retained hashes in the LSH buckets are six times larger than the original minhashes. I understand this happens because textreuse uses a md5 digest to create the bucket hashes.

But isn't this too wasteful / overkill, and can't I improve it? Is it normal that our data reduction technique ends up bloating up to this extent? And isn't it more efficacious to match the documents based on the original hashes (similar to perms = 256 and bands = 256) and then use a threshold to weed out the false positives?

Note that I have reviewed the typical texts such as Mining of Massive Datasets, but this question remains about this particular implementation. Also note that the question is not only out of curiosity, but rather out of need. When you have millions or billions of hashes, these differences become significant.


Solution

  • Package author here. Yes, it would be wasteful to use more hashes/bands than you need. (Though keep in mind we are talking about kilobytes here, which could be much smaller than the original documents.)

    The question is, what do you need? If you need to find only matches that are close to identical (i.e., with a Jaccard score close to 1.0), then you don't need a particularly sensitive search. If, however, you need to reliable detect potential matches that only share a partial overlap (i.e., with a Jaccard score that is closer to 0), then you need more hashes/bands.

    Since you've read MMD, you can look up the equation there. But there are two functions in the package, documented here, which can help you calculate how many hashes/bands you need. lsh_threshold() will calculate the threshold Jaccard score that will be detected; while lsh_probability() will tell you how likely it is that a pair of documents with a given Jaccard score will be detected. Play around with those two functions until you get the number of hashes/bands that is optimal for your search problem.