The question below relates to a Django application (1.8.8) using SOLR (4.9.0) and Haystack.
The data I need to search contains various strings such as "A1234" and "ABCDE1"; these strings will turn up in both "text" and "name" fields defined as follows:
name = indexes.CharField(indexed=True, model_attr="name")
text = indexes.EdgeNgramField(document=True, use_template=True)
If one of the above strings is searched for in the text field then it won't be found, but there's no problem searching in the name field. If I omit the letter when searching in the text field (e.g. I search for "1234") then I can find what I'm looking for.
Querying the SOLR server directly with debugging enabled shows that these strings are split:
// text field - no hits
rawquerystring: "A1234",
querystring: "A1234",
parsedquery: "+text:a +text:1234",
parsedquery_toString: "+text:a +text:1234",
explain: { },
QParser: "LuceneQParser",
// name field - finds the correct records
rawquerystring: "name:A1234",
querystring: "name:A1234",
parsedquery: "name:a1234",
parsedquery_toString: "name:a1234",
explain: { },
QParser: "LuceneQParser",
The section of schema.xml relating to edge_ngram fields (the text field above being such) is as follows:
<fieldType name="edge_ngram" class="solr.TextField" positionIncrementGap="1">
<analyzer type="index">
<tokenizer class="solr.NGramTokenizerFactory" minGramSize="3" maxGramSize="15"/>
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="
0" splitOnCaseChange="1" splitOnNumerics="0"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
</analyzer>
</fieldType>
So, is there any means of preventing the split of these strings? I would have thought that the splitOnNumerics="0" option would have sorted the problem out (as suggested in Solr: Can't search for numbers mixed with characters) but it appears that that cannot be applied to a solr.EdgeNGramFilterFactory. I have used this latter factory because it got around another problem where a search for "foo bar" would not find "foobar.com" in the text field (users will be running this sort of search and expecting a hit).
Does anyone have any suggestions for fixing this?
Finally found it. The edge_ngram field type contained this:
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
Modifying the WordDelimiterFilterFactory to set generateNumberParts="0" did the trick whilst preserving the other requirements for this field.