I'm using Solr with Sunspot/dismax. Is it possible to query for non-alphabetic characters? I.e.:
~ ! @ # $ % ^ & * ( ) _ + - = [ ] { } | \
I'm aware that +/-
must be escaped, as they are dismax inclusion/exclusion operators. But I'm getting no matches when I search for any of these characters:
Foo.search { fulltext '=' }.results.length # => 0
Foo.search { fulltext '\=' }.results.length # => 0
Yet:
Foo.search { fulltext 'a'}.results.length # => 30
Here is the tokenizer config I'm using:
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Solr's StandardTokenizer
drops all 'special characters', since it's optimized to use with plain text. So for example '=' won't be found because it's being stripped from the text during indexing.
One of tokenizers that preserve all characters is WhitespaceTokenizer
, which splits input only on whitespace. You need to evaluate if it's a good solution to your problem, as it will produce tokens like this:
20-year-old fox jumps over the lazy dog. -> '20-year-old', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog.'
It may happen that you will need to provide your own tokenizer (not necessary by implementing one, you can define appropriate regular expression for split characters and use PatternTokenizer
) or use filter like WordDelimiterFilter
or PatternReplaceFilter
.