Search code examples
xmlsolrtokenizeanalyzer

Solr analyzers and order of tokenizers and filters


Debugging SOLR filters is difficult because you can't see the result. From a test performed it seems that the order in the Analyzer is always to run first the Tokenizer and then the Filters, no matter the order in XML.

Reason why suspicious, here

      <!-- all to lower case -->
      <filter class="solr.LowerCaseFilterFactory"/>
      <!-- first convert all to ASCII -->
      <filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="false" />
      <!-- all punctuation replaced by nothing -->
      <filter class="solr.PatternReplaceFilterFactory" pattern="([^a-z0-9\s]+)" replacement=""  replace="all"/>
      <tokenizer class="solr.StandardTokenizerFactory"/>

The idea is that for instance if you have a name like Ying-yang it would collapse to yingyang and we could search for that if we wanted. However this doesn't work with the StandardTokenizerFactory (we get no results searching for yingyang), but does work if we instead take the KeywordTokenizer. Which suggests the dash is causing a tokenization. The regular expression should have removed the dash. The fact that it does work with the KeywordTokenizer proves the regex works fine.

So is anyone aware if it is a limitation in analyzers in SOLR that they require the tokenizer to be run first? all examples online show a tokenizer first, so I don't know if anyone has tried filterering before tokenizing.


Solution

  • Your observation is correct - a tokenizer always run before filters, but CharFilters run even before that.

    You can use a PatternReplaceCharFilterFactory to run your replacement before the tokenizer sees your string:

    <charFilter class="solr.PatternReplaceCharFilterFactory"
             pattern="[^a-z0-9\s]" replacement="" />
    <tokenizer ...>
    

    And your initial assumption is wrong (i.e. "Debugging SOLR filters is difficult because you can't see the result."). If you go to your core / collection in Solr Admin and select the "Analysis" link in the collection menu, you'll get a dropdown of all your defined fields. Enter the text you want to input into your index on the left side and the query you expect the user to type on the right side, and you'll get the tokens generated for each step in the chain, and can see exactly how they're being processed for any charfilters, the tokenizer and any following filters.

    In your case the WhitespaceTokenizer might be better suited than the StandardTokenizer, but that will also mean that just searching for "Ying" won't give you a hit when the name is "Ying-yang". In that case you can define multiple fields with different analysis chains, and use a copyField instruction to copy the same content into these different fields. You can then use qf (with the edismax handler) to search the different fields and apply different weights based on how exact you consider the field to be (i.e. give more weight to a hit in an exact field than a field with a StandardTokenizer).