Search code examples
solrlucenefaceted-searchfacet

Lucene Analyzer chain: ShingleFilter without filler tokens


In my analyzer chain, ShingleFilter comes after stopword filter. As mentioned in the docs, ShingleFilter handles position increments > 1 by inserting filler tokens (tokens with termtext "_").

For example : "please divide this sentence into biword shingles" 

Shingles of size 2 : please divide, divide _, _ sentence, sentence _, _ biword, biword shingles (assuming that "this, "into" are stopwords)

I would like to eliminate those shingles with the filler tokens, i.e. my desired output contains only: please divide, biword shingles.

I've a dedicated field for facets with shingles up to 4-grams. Due to these stopwords, all the facet constraints (or values) look useless with those fillers like "divide _ sentence _"

Please could you guide me.

Using Solr 4.4.

UPDATE

I thought of setting enablePositionIncrement to false in StopFilter configuration. Not sure whether that solves the problem or not but Lucene 4.4 doesn't support that anymore.


Solution

  • Add PatternReplaceFilterFactory in your analyzer chain after ShingleFilterFactory. Replace all Token containing filler token with empty string i.e. "".

    This may solve your problem temporarily but for permanent solution have to write your own analyzer or customize ShingleFilter.

    Sample FieldType:

    <fieldType name="text_general_shingle" class="solr.TextField" positionIncrementGap="100">     
            <analyzer>
           <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />       
            <filter class="solr.LowerCaseFilterFactory"/>           
            <filter class="solr.ShingleFilterFactory" maxShingleSize="3" outputUnigrams="true"/>
            <filter class="solr.PatternReplaceFilterFactory" pattern=".*_.*" replacement=""/>       
        </analyzer>     
        </fieldType>