In Solr (3.3), is it possible to make a field letter-by-letter searchable through a EdgeNGramFilterFactory
and also sensitive to phrase queries?
By example, I'm looking for a field that, if containing "contrat informatique", will be found if the user types:
Currently, I made something like this:
<fieldtype name="terms" class="solr.TextField">
<analyzer type="index">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<tokenizer class="solr.LowerCaseTokenizerFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
</analyzer>
<analyzer type="query">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<tokenizer class="solr.LowerCaseTokenizerFactory"/>
</analyzer>
</fieldtype>
...but it failed on phrase queries.
When I look in the schema analyzer in solr admin, I find that "contrat informatique" generated the followings tokens:
[...] contr contra contrat in inf info infor inform [...]
So the query works with "contrat in" (consecutive tokens), but not "contrat inf" (because this two tokens are separated).
I'm pretty sure any kind of stemming can work with phrase queries, but I cannot find the right tokenizer of filter to use before the EdgeNGramFilterFactory
.
As alas I could not manage to use a PositionFilter
right like Jayendra Patil suggested (PositionFilter makes any query a OR boolean query), I used a different approach.
Still with the EdgeNGramFilter
, I added the fact that each keyword the user typed in is mandatory, and disabled all phrases.
So if the user ask for "cont info"
, it transforms to +cont +info
. It's a bit more permissive that a true phrase would be, but it managed to do what I want (and doesn't return results with only one term from the two).
The only con against this workaround is that terms can be permutated in the results (so a document with "informatique contrat" will also be found), but it's not that a big deal.