Search code examples
javasolrsolrnet

Sorting of Field having special characters in SoLR


i am new at SoLR indexing. I want to sort location field which have different values.it also contains values which starts with 'sAmerica, #'Japan, %India and etc.

Now when i sort this field i do want to consider special characters like 's,'#,!,~ and etc. i want sorting which will ignore this chars and returns results like America at 1st position, %India at 2nd and #'Japan at 3rd position..

How to make it possbile? i am using PatternReplaceFilterFactory,but don't know about this.

  <analyzer type="query">
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory" />
    <filter class="solr.WordDelimiterFilterFactory" catenateWords="1"  />
    <filter class="solr.PatternReplaceFilterFactory" pattern="'s" replacement="" replace="all" />
  </analyzer>
</fieldType>


Solution

  • IF you want to ignore the special characters, try using the following field type.
    This would lower case the words and catenate the words excluding all special chars.

        <fieldType name="string_sort" class="solr.TextField" positionIncrementGap="1">
            <analyzer type="index">
                <tokenizer class="solr.KeywordTokenizerFactory" />
                <filter class="solr.LowerCaseFilterFactory" />
                <filter class="solr.WordDelimiterFilterFactory" catenateWords="1" />
            </analyzer>
        </fieldType>
    

    However, this would not work for 'sAmerica as s is not a special character.

    <filter class="solr.PatternReplaceFilterFactory" pattern="'s" replacement="" replace="all" />
    

    If this is fixed pattern you need to replace it before the word delimiter with above.

    Edit -- Are you using this config ?

    <fieldType name="string_sort" class="solr.TextField" positionIncrementGap="1">
        <analyzer type="index">
            <tokenizer class="solr.KeywordTokenizerFactory" />
            <filter class="solr.LowerCaseFilterFactory" />
            <filter class="solr.PatternReplaceFilterFactory" pattern="'s" replacement="" replace="all" />
            <filter class="solr.WordDelimiterFilterFactory" catenateWords="1" />
        </analyzer>
    </fieldType>
    

    Have tested the following through analysis and it produces the following tokens -

    KT - 'sAlgarve
    LCF - 'salgarve
    PRF - algarve
    WDF - algarve

    Can you check through the analysis.