Can someone provide a little more detail than is in the Solr manual for configuring fields to be used for spellcheck?
WhiteSpaceTokenizerFactory
, WordDelimiterFactory
(to omit punctuation such as commas and periods after word tokens), StopFilterFactory
and RemoveDuplicatesTokenFilterFactory
. Is this reasonable?DirectSolrSpellChecker
and on SpellCheckComponent
, but the only additional logging output that I see is the request to perform spell checking. Looking at the code, I don't see any additional DEBUG outputs, and looking into the underlying Lucene component, I don't see any DEBUG outputs at all. Is there another logger to enable?--EDIT--
I discovered that it pays to try different misspellings with the same Levenshtein distance. Strangely, some misspellings are corrected and some are not. Case in point:
The corpus has 3069 instances of "hydraulic," 17 instances of "hydrauhc," 14 of "hydraullc", 3 of "hydrauli", and 3 of "hydraulrc." (There's a lot of OCR in the corpus.)
Solr will suggest these works given a query for "hydrulic" or "hydruulic," but will suggest nothing given "hydralic" or "hydraalic." But the Levenshtein distance between each of these four words and "hydraulic" is 1.
Figured it out.
If the misspelled word is in the corpus, but documents that contain it are filtered out by an fq
parameter, then the spelling corrector will return no suggestions, but will tell you the word is misspelled if spellcheck.extendedResults
is true.
This paragraph in the Solr spellchecking documentation is critical:
spellcheck.alternativeTermCount Defines the number of suggestions to return for each query term existing in the index and/or dictionary. Presumably, users will want fewer suggestions for words with docFrequency>0. Also, setting this value enables context-sensitive spell suggestions.
Rewording the above:
If the search term exists in the index, but not in the result set, no corrections will be offered unless spellcheck.alternativeTermCount > 0.