Search code examples
ruby-on-railselasticsearchelasticsearch-2.0

Elasticsearch configure stemming for fields that can be french or english


I have a Fr/En website, and I have a "profile" model, where information can be written in English OR in French (and I don't know which one). Consider a simple model on Mongoid

class Profile
  field :job_name
  field :company_name
end

I want an intelligent search on the job name that supports stemming. So basically I want an english+french analyser on that field

I believe I have figured something for the indexing part, where I analyse the field in both languages :

mapping do
  indexes :job_name, type: :string, fields: {
    french: { type: :string, analyzer: 'french' },
    english: { type: :string, analyzer: 'english' }
  }
end

I have problems configuring the stemming on the search. I'm actually not. My default search engine uses multi match with per-field boosting, and I don't really understand how to spcify the analysers on top of that :

query: {
  filtered: {
    query: {
      multi_match: {
        query: query,
        fields: [
          "company_name^3",
          "job_name^2",
        ],
        type: "best_fields",
        tie_breaker: 0.3
      }
    }
  }
}

Ideally, when searching for "achat" (French for purchase), the engine should return results where the job name contains

  • "gestionnaire d'achat" (see the "d'" prefix),
  • "achats en gros" (see the plural).

And it should also work for similar english words

EDIT : My ES Index (is the "no" normal ?)

{
  "mydb": {
    "aliases": {},
    "mappings": {
      "profile": {
        "properties": {
          "company_name": {
            "type": "string"
          }
          "job_name": {
            "type": "string",
            "index": "no",
            "fields": {
              "english": {
                "type": "string",
                "analyzer": "english"
              },
              "french": {
                "type": "string",
                "analyzer": "french"
              }
            }
          }
        }
      },
    "settings": {
      "index": {
        "creation_date": "1469789941429",
        "number_of_shards": "5",
        "number_of_replicas": "1",
        "uuid": "HHN-rWTTStCXDgQtJMTEPg",
        "version": {
          "created": "2030499"
        }
      }
    },
    "warmers": {}
  }
}

Solution

  • You can use the wildcard version for the field names and multi_match will use each sub-fields' analyzer for that:

          "fields": [
            "job_name.*^2"
          ],