Search code examples
algorithmnlpspell-checking

Constant-time Spelling Correction on Ten Million Entities


I have a list of ~10M entities. I need to match an entity that a user types out with an entity from the list. Users often misspell the entities (ie. orang instead of orange). I need to correct 1-2 instances of letter replacement (aca instead of aba), letter insertion (aca instead of ac), and letter deletion (aca instead of acca). I want to do this in constant time with respect to the size of the entity list.

Making a dictionary of all possible spellings that are off by 1-2 letters would be constant time but requires an intractably large amount of memory. Edit distance is linear in time with respect to the size of the entity list. I'm thinking there is probably a clever algorithm to prune down the candidate matches to <100 (maybe via a clever hash of the letters in the entity). Then I could run edit distance on the small set of candidates.

Does anyone know of a technique that will work here?


Solution

  • In addition to the linked document in Matt's comment (suggesting to only generate/compare/search via deletions), you can try using a DAWG aka MADFA aka DAFSA to store all possible distance=2 words. For example, for Python there's pyDAWG. Not sure if the space savings will be sufficient to meet your needs, as that depends on language, but if you affixes are similar, it could be quite significant: Each substitution/deletion is just an extra arc, and each insertion is only one more node.