Search code examples
solrspell-checking

Guidance beyond Solr manual needed for spellcheck


Can someone provide a little more detail than is in the Solr manual for configuring fields to be used for spellcheck?

  1. I'm using a DirectSolrSpellChecker. I assume that, as for IndexBasedSpellChecker, I should avoid fields that are "heavily processed." Analyzers for the field I'm using are WhiteSpaceTokenizerFactory, WordDelimiterFactory (to omit punctuation such as commas and periods after word tokens), StopFilterFactory and RemoveDuplicatesTokenFilterFactory. Is this reasonable?
  2. The manual never explicitly states whether the field used for spelling needs to be stored. I have run some unit tests with the embedded Solr server, and it appears that the field need only be indexed. It also appears that the field can be single- or multi-valued. Are this assumptions correct?
  3. Are there any diagnostics to analyze why a query with a misspelled word that is an edit distance of 1 away from a correctly spelled word does not produce a suggestion? Specifically, the correctly-spelled word is in the field used for spellchecking (I can query it), but a request to the search handler with spellcheck enabled returns the spellcheck suggestion field, but it is empty. (In a toy example with the embedded server and a couple of documents loaded, I can produce suggestions, but in an actual core with thousands of documents, the same test produces empty results.)
  4. I have enabled ALL logging on DirectSolrSpellChecker and on SpellCheckComponent, but the only additional logging output that I see is the request to perform spell checking. Looking at the code, I don't see any additional DEBUG outputs, and looking into the underlying Lucene component, I don't see any DEBUG outputs at all. Is there another logger to enable?

--EDIT--

I discovered that it pays to try different misspellings with the same Levenshtein distance. Strangely, some misspellings are corrected and some are not. Case in point:

The corpus has 3069 instances of "hydraulic," 17 instances of "hydrauhc," 14 of "hydraullc", 3 of "hydrauli", and 3 of "hydraulrc." (There's a lot of OCR in the corpus.)

Solr will suggest these works given a query for "hydrulic" or "hydruulic," but will suggest nothing given "hydralic" or "hydraalic." But the Levenshtein distance between each of these four words and "hydraulic" is 1.


Solution

  • Figured it out.

    If the misspelled word is in the corpus, but documents that contain it are filtered out by an fq parameter, then the spelling corrector will return no suggestions, but will tell you the word is misspelled if spellcheck.extendedResults is true.

    This paragraph in the Solr spellchecking documentation is critical:

    spellcheck.alternativeTermCount Defines the number of suggestions to return for each query term existing in the index and/or dictionary. Presumably, users will want fewer suggestions for words with docFrequency>0. Also, setting this value enables context-sensitive spell suggestions.

    Rewording the above:

    If the search term exists in the index, but not in the result set, no corrections will be offered unless spellcheck.alternativeTermCount > 0.