I' using lucene with solr to index some documents (news). Those documents also have an HEADLINE.
Now I try to make an facet search over the HEADLINE field to find the terms with the highest count.
All this works without an problem including an stopword-list.
The HEADLINE field is an multi valued field. I use the solr.StandardTokenizerFactory
to split those field into single terms (I know, this is not best practise, but it's the only way and it works).
sometimes, the tokenizer splits terms, which shouldn't be splitted, like 9/11
(which is splitted into 9 and 11). So I decided to use an "protword" list. "9/11" is part of this protword list. But no change.
Here is the part from my schema.xml
<fieldType name="facet_headline" class="solr.TextField" omitNorms="true">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory" protected="protwords.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.TrimFilterFactory" />
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
protected="protwords.txt"
/>
</analyzer>
</fieldType>
looking at the facet result, i see a lots of documents dealing with "9/11" grouped (faceted) at "9" or "11" but never "9/11".
Why this does not work?
Thank you.
the final solution for that problem was to choose the solr.PatternTokenizerFactory