I want to ignore special characters during query time in SOLR . For example : Lets assume we have a doc in SOLR with content:My name is A-B-C .
content:A-B-C retunrs documents but content:ABC doesnt return any document .
My requirement is that content:ABC should return that one document . So basically i want to ignore that - during query time .
To get the tokens concatenated when they have a special character between them (i.e. A-B-C
should match ABC
and not just A
), you can use a PatternReplaceCharFilter. This will allow you to replace all those characters with an empty string, effectively giving ABC
to the next step of the analysis process instead.
<analyzer>
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="[^a-zA-Z0-9 ]" replacement=""/>
<tokenizer ...>
[...]
</analyzer>
This will keep all regular ascii letters, numbers and spaces, while replacing any other character with the empty string. You'll probably have to tweak that character group to include more, but that will depend on your raw content and how it should be processed.
This should be done both when indexing and when querying (as long as you want the user to be able to query for A-B-C
as well). If you want to score these matches differently, use multiple fields with different analysis chains - for example keeping one field to only tokenize on whitespace, and then boosting it higher (with qf=text_ws^5 other_field
) if you have a match on A-B-C
.
This does not change what content is actually stored for the field, so the data returned will still be the same - only how a match is performed.