I have documents in Solr/Lucene (3.x) with a special copy field facet_headline in order to have an unstemmed field for faceting.
Sometimes 2 ore more words are belong together, and this should be handled/counted as one word, for example "kim jong il".
So the headline "Saturday: kim jong il had died" should be split into:
Saturday
kim jong il
had
died
For this reason I decided to use protected words (protwords), where I add kim jong il
.
The schema.xml
looks like this.
<fieldType name="facet_headline" class="solr.TextField" omitNorms="true">
<analyzer>
<tokenizer class="solr.PatternTokenizerFactory" pattern="\?|\!|\.|\:|\;|\,|\"|\(|\)|\\|\+|\*|<|>|([0-31]+\.)" />
<filter class="solr.WordDelimiterFilterFactory" splitOnNumerics="0"
protected="protwords.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.TrimFilterFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
</analyzer>
</fieldType>
Using the solr analysis it looks like that doesn't work!
The string is still split into 6 words. It looks like the protword.txt is not used, but if the headline ONLY contains the name: kim jong il
everything works fine, the terms aren't split.
Is there a way to reach my goal: not to split specific words/word groups?
after searching the web a came to the point, that it's not possible to reach the goal. It looks like, this is not the focus of all the tokenizer and filters.