Search code examples
c++unordered-maptrielevenshtein-distance

Implement "did you mean" over the keys of an unordered_map


My situation is: I have an unordered map of names to stuff.

Clients may input names —say fooo— which are going to be searched (with find()) and not found keys will print "not found".

I would like to offer the client a better output: "fooo not found. did you mean foo ?"

I reckon it's not going to be possible unless implementing a trie that mirrors the key collection, to apply "find smallest levenstein distance" algorithm on. Do I reckon badly or do I reckon correctly ?


Solution

  • It is almost certainly not worth getting fancy here. Implement the brute force solution that iterates through all possible keys, computes a distance, and then takes the minimum. Profile it, and you'll probably find it's fast enough.

    But if you want to have fun...

    String edit distance follows the triangle inequality, which means any geometric approx-near-neighbor data structure that can take arbitrary distance functions applies here. I'm fond of LSH.

    But ANN gets worse as the dimension increases, and dimension is roughly string length. So you might want a less rigorous approach. BLAST (genome search) does substring-based exact lookup. Your strings are shorter, so you might need bigram or trigram. Alternatively, you might figure the length will be close to correct, and just check everything that's a near-match there.

    If you have access to a large database of typos, you could try training a convolutional neural net (one-hot encode each character) to map strings to low-dimensional feature vectors with a cost function that put typos close to their intended strings. Then keep the feature vectors of the legit strings in a KD tree.

    But all that's for fun. If the code matters, keep it simple.