I have defined two dynamic fields solr 5 schema:
<dynamicField name="*_texts_en" stored="true" type="text_en" multiValued="true" indexed="true"/>
<dynamicField name="*_texts_pt" stored="true" type="text_pt" multiValued="true" indexed="true"/>
for documents in English and in Portuguese, with the following index and query analyzers:
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="text_pt" class="solr.TextField" omitNorms="false">
<analyzer type="index">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_pt.txt" format="snowball" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PortugueseLightStemFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_pt.txt" format="snowball" />
<filter class="solr.LowerCaseFilterFactory"/>
<!-- <filter class="solr.BrazilianStemFilterFactory"/> -->
<filter class="solr.PortugueseLightStemFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="false"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory" />
</analyzer>
</fieldType>
A document can be either in Portuguese and English, and it will use something like 'body_texts_en' as a field in English. If in Portuguese: 'body_text_pt'.
However, I am experiencing problems with a search query to both fields simultaneously when solr.StopFilterFactory is used in the filter chain. That is, when I search for a certain query without knowing the language, I query solr in this way:
{
"responseHeader": {
"status": 0,
"QTime": 1,
"params": {
"q": "suco de limão",
"defType": "edismax",
"indent": "true",
"qf": " body_texts_pt body_texts_en",
"wt": "json",
"lowercaseOperators": "true",
"stopwords": "true",
"_": "1430434475811"
}
},
"response": {
"numFound": 0,
"start": 0,
"docs": []
}
}
The query above was done using terms in Portuguese. Even though the index had matching documents, no results are returned. On the other hand, as soon as I:
remove 'body_texts_en' from 'qf' param (in the solr request), OR
remove all solr.StopFilterFactory filters from all analyzers,
the matching documents are correctly returned.
Thus, the problem here is in the use of solr.StopFilterFactory and simultaneous query to two fields, each one having its own use of solr.StopFilterFactory (as shown above).
Is there any hope of having the query above to work as expected?
Thanks in advance.
EDIT (Ruby function I wrote based on the response of @frances for his solution number 2):
def multiple_language_query_solr(q)
fields = {'title' => 2, 'body' => 1}
query = []
I18n.available_locales.each do |locale|
locale = locale.to_s.split('-').first
fields.each do |field, boost|
field = "#{field}_texts_#{locale}" + (boost > 1 ? "^#{boost}" : '') + ':'
sentence = q.split(' ').map do |word|
field + word
end.join(' AND ')
query << "(#{sentence})"
end
end
query.join(' OR ')
end
With best regards, Eric
This may or may not be the issue in your case, but I think I know what's happening here. You didn't specify your mm
(Minimum Should Match) value, which I suspect is set to at least "3" or "70%". (As an aside, in the future if you add the argument echoParams=all
to your Solr query, parameters set in your solrconfig.xml
that are active in the search will also be returned, giving a more complete picture of the search.)
When you search only the Portuguese text field, the query parser expands your query like this:
( body_texts_pt:suco ) ( body_texts_pt:limão )
Because "de" is in your Portuguese stopword filter, it is eliminated from your search entirely and two out of two (100%) of your remaining terms match. When your search uses both fields, it will be expanded like this:
( body_texts_pt:suco | body_texts_en:suco ) ( body_texts_en:de )
( body_texts_pt:limão | body_texts_en:limão )
This time "de" was not eliminated from all search fields in your search, and so it remains a term in your search. But because it was eliminated from the search of the Portuguese text, it can only match against the English text. The result: two out of three (~66%) of your terms match instead of two out of two. If your mm
value is strict, then (with apologies to Meat Loaf,) two out of three may not cut it.
The Solution?
1. Turn off stop word filtering (easy solution - recommended)
This problem is completely resolved when stop word filter configurations match in all of your searched fields. Since you wouldn't be able to to apply a sensible unified set of stop words across the English and Portuguese fields, that leaves not using stop words at all. Stop word filtering doesn't often make as much difference to the efficiency of your index as one might imagine. I would suggest rebuilding your index with all stop word filtering deactivated, to see if this makes a noticeable difference in speed.
2. Pre-process the query string (more complicated)
You're using the Extended Dismax Query Parser (edismax). The main difference between this and the Dismax Query Parser (dismax) is support for logical/boolean queries. If you expand the query yourself, you can create a logical structure that works for you. For the search: suco de limão, the preprocessed search that actually gets sent to Solr might be:
(body_texts_pt:suco AND body_texts_pt:de AND body_texts_pt:limão) OR
(body_texts_en:suco AND body_texts_en:de AND body_texts_en:limão)
With this query, the term body_texts_pt:de
is eliminated by the stop word filter, so either the words "suco" and "limão" must match against the Portuguese text, or the words "suco" and "de" and "limão" must match against the English text.
One caveat about this solution is that it makes the assumption that the entire search will be in only one language. A mixed English and Portuguese search will probably fail because the complete set of words (excluding stop words) can't be found in only one of the text fields.