I am developing an application that supports indexing & searching of multi-language texts, including hebrew, using the "solr" engine.
After lots of searches, I found that HebMorph is the best plugin to use for hebrew language
My problem is that the behavior of HebMorph with hebrew stopwords seems to be different than solr:
Whith solr (any language): when I search for a stopword, the results returned doesn't include any of the stopwords exxisting in query.
Whereas when I search for hebrew terms (after pluging HebMorh in solr following this link, the returned results include all existing stopwords in the query.
1) Is this the normal behavior for HebMorph? If yes, how can I alter it? If no, what should I change?
2) Since HebMorph doesn't support synonyms, (as I read in their documentation that it is a future work). Is there a way to use synonyms for hebrew as other languages the way solr supports it? (i.e. by adding the proper filter in solrconfig and pointing out to the synonyms file)?
Thanks in advance for your help.
I'm the author of HebMorph.
StopWords are indeed supported, but you need to filter them out before the lemmatizer kicks in. Assuming a recent version of HebMorph - your stopwords filter needs to come in right after the tokenizer, which means it needs to take care also of בחל"מ letters attached to the stop-words.
The general advice nowadays, for all languages, is NOT to drop stopwords - at least not in indexing, so I'd recommend not applying a stop-words filter here either.
With regards to synonyms - the root issue is with the HebMorph lemmatizer expanding a word to multiple lemmas at times, which makes the work of applying synonyms a bit more challenging. With the (relatively) new graph based analyzers this is now possible to do so we will likely implement that too and Lucene's Synonym filters will be supported OOTB.
In the commercial version there is already a way to customize word lists and override dictionary definitions, which is useful in an ambiguous language like Hebrew. Many use this as their way of creating synonyms.