Search code examples
apachesolrspell-checking

Multilingual SOLR Spellcheck Setup


We are trying to setup multilingual spellchecking option in SOLR, and have just finished setting up the basic SOLR environment.

We are using a field 'spell' to do a spellcheck on.

<lst name="spellchecker"> 
 <str name="name">default</str> 
 **<str name="field">spell</str>**
 --Rest are not specified - solrdefaults to IndexBasedSpellChecker --
</lst>

There is an existing language field LANGUAGE_STRING that is already being indexed and stored.(Language Detection not required at the moment.)

Is there a way that i can use this field to build the additional spell_* fields below while importing/updating content?

<requestHandler name="/select" class="solr.SearchHandler" lazy="true">
  <lst name="defaults">
    <str name="echoParams">explicit</str>
    <int name="rows">10</int>
    <str name="spellcheck.dictionary">default</str>
    **<str name="spellcheck.dictionary">spell_en</str>**
    **<str name="spellcheck.dictionary">spell_de</str>**
    <str name="spellcheck.count">1</str>
  </lst>
  <arr name="last-components">
    <str>spellcheck</str>
  </arr>
</requestHandler>

I am planning to use the Single core approach with language separation by document language field as suggested in http://pavelbogomolenko.github.io/multi-language-handling-in-solr.html


Solution

  • Answering a solution for my question so that it helps othes who are looking for a similar option. Apart from the Solr Suggester alternative, the solution that works for building a multilingual spell dictionary is to use the Script Update Processor and attach it to the /update handler using update.chain.

    <updateRequestProcessorChain name="script">
      <processor class="solr.StatelessScriptUpdateProcessorFactory">
        <str name="script">update-script.js</str>
        <lst name="params">
          <str name="config_param">Spell_Field</str>
        </lst>
      </processor> ...
    

    The javascript update-script.js file is as below :

    function processAdd(cmd) {
      var doc = cmd.solrDoc;  // org.apache.solr.common.SolrInputDocument
      var locale = doc.getFieldValue("locale");
      logger.info("update-script#processAdd: site=" + site);
    
      if(site){
       var lang_str = site.substring(0,2);
       logger.info("update-script#processAdd: language=" + lang_str);
    
      if(lang_str){      
         var spellField = "";
         var slash=" / "; //Use the Standard Tokenizer Factory 
         var field_names = doc.getFieldNames().toArray();        
         for(i=0; i < field_names.length; i++) {
            field_name = field_names[i];
            if ( field_name) { spellField+=  doc.getFieldValue(field_name)+ slash;}
          }                
         doc.addField("spell_text_"+lang_str,spellField);   //Existing dynamic field definition(*_txt_en, *_txt_de, etc) in schema.xml per languauage tokenizes this.
         logger.info("update-script#processAdd: spell_text_"+lang_str+ ":" + spellField);
      }
     }  
    }
    
    function processDelete(cmd) {  // no-op }
    function processMergeIndexes(cmd) {  // no-op }
    function processCommit(cmd) {  // no-op }
    function processRollback(cmd) {  // no-op }
    function finish() {   // no-op }
    

    Now you can use these spell_txt_* fields to wire them to the spellchecker dictionaries and you have suggestions based on the language.

    There were several sources I had checked but following should be sufficient for most cases: https://lucidworks.com/post/getting-started-spell-checking-with-apache-lucene-and-solr/