Search code examples
algorithmhashedit-distancelocality-sensitive-hash

Locality-sensitive hashing of strings?


Is there a hash function for strings, such that strings within a small edit distance (for example, misspellings) would map to the same, or very close, hash values, while dissimilar strings would tend not to?


Solution

  • One option is to calculate set of all k-mers (substrings of length k), hash them and calculate the minimum. So you are combining idea of shingles, with idea of minhashing. (repeat multiple times to get better results, as usual with LSH schemes)

    The way how this works is that probability of two string having same minhash is same as Jackard similarity of their k-mer sets. Similarity of k-mer sets is related to edit distance (but not the same).