I'm building an autocomplete that will have to query a 10+ million words/phrases quickly, and am running into some problems. My first idea was to go through some sort of trie/ternary tree structure, but those are strictly prefix matching, which isn't good enough for my application (i want full infix matching). I then moved to some of the bigger solutions, SqlServer FullText Indexing, Lucene, Solr, Sphinx, but Lucene and SqlServer FullText Indexing aren't actually fulltext, but prefix with nifty features (soundex, proximity, etc). I tried to think of a way Levenshtein edit distance could help, but couldn't find a way to be both at least reasonably accurate as well as supporting words with high edit distances (i.e. google and ogl. edit distance of 3, but 3 is way to high a threshold a general case).
My question is, how do powerhouses like Google/bing etc do it? Do they just brute force it after a bit? I would imagine no, but I can't find any support of that.
Any help would be appreciated!
If you enable queryParser.setAllowLeadingWildcard(true);
in Lucene, you can use leading and trailing wildcards like:
*talli*
That would pick up all single-word terms that contain "talli" including "Metallica".
This may not be fast enough for you, but in some cases (prefix-only wildcard searches to be precise) if you can preprocess the query string you might be able to get by with the old "reverse the term and index that also" trick:
acillateM