Search code examples
solrlucenetokenizeprotectedwords

solr not tokenizing protected words


I have documents in Solr/Lucene (3.x) with a special copy field facet_headline in order to have an unstemmed field for faceting.

Sometimes 2 ore more words are belong together, and this should be handled/counted as one word, for example "kim jong il".

So the headline "Saturday: kim jong il had died" should be split into:

Saturday kim jong il had died

For this reason I decided to use protected words (protwords), where I add kim jong il. The schema.xml looks like this.

   <fieldType name="facet_headline" class="solr.TextField" omitNorms="true">
        <analyzer>
           <tokenizer class="solr.PatternTokenizerFactory" pattern="\?|\!|\.|\:|\;|\,|\&quot;|\(|\)|\\|\+|\*|&lt;|&gt;|([0-31]+\.)" />
           <filter class="solr.WordDelimiterFilterFactory" splitOnNumerics="0" 
                   protected="protwords.txt" />
           <filter class="solr.LowerCaseFilterFactory"/>
           <filter class="solr.TrimFilterFactory"/>
           <filter class="solr.StopFilterFactory"
           ignoreCase="true"
           words="stopwords.txt"
           enablePositionIncrements="true"
           />
        </analyzer>
   </fieldType>

Using the solr analysis it looks like that doesn't work! The string is still split into 6 words. It looks like the protword.txt is not used, but if the headline ONLY contains the name: kim jong il everything works fine, the terms aren't split.

Is there a way to reach my goal: not to split specific words/word groups?


Solution

  • after searching the web a came to the point, that it's not possible to reach the goal. It looks like, this is not the focus of all the tokenizer and filters.