Search code examples
mongodbmongoosemongodb-querymongoosastic

MongoDB: Index strategy for elastic search (collation and strength)


I am using MongoDB and have a pretty small collection of documents, here it is:

name: 'Always en_GB locale, default name',
ticker: 'Derivative short name from the name field. Like if my full name is Aleksandr, then ticker will be Sasha'
name_locale: 'This field has a string in different locales, sometimes it\'s Cyrillic, sometimes it\'s Spanish',
locale: 'en_GB', //or 'en_US', it shows the locale of name_locale field
region 'Europe', //or North America, it doesn't matter in that case

All these fields are String type only.

Also, I have an API endpoint, and I want to receive a parameter, that could be any value of these fields. So it could be param=Europe or param=en_GB or param=name_locale_value.

So, in that case, I need a text index, right? And I have it.

schema.index(
  { name: 'text', name_locale: 'text', ticker: 'text', region: 'text' },
  { name: 'SearchQuery' },
);

but as you may notice, I don't have any collation and strength values. So the question is:

How to make case-insensitive searches, with strength: 1, like in this answer, if I also need to include/search by name_locale field, that has string values in different languages?

Docs example:

{
  name: 'Aleksandr,
  ticker 'Sasha',
  name_locale: 'Саша',
  locale: 'ru_RU',
  region: 'Europe',
},{
  name: 'Jonathan',
  ticker 'John',
  name_locale: 'Jonathan',
  locale: 'en_US',
  region: 'North America',
}
await collection.find(
          { $text: { $search: QueryValue } }, //QueryValue = 'europe'
          { score: { $meta: 'textScore' } },
        )
  • Do I need to build many single field indexes? And use $or operator to find through all the fields.
  • Or does text indexes in Mongo 4.4 supports strength without locale?
  • Maybe there is another universal way, like using .find with Regexp? I will be grateful if you share some of your advice.

Solution

  • What "case-insensitive" means is locale-specific. This means when you are querying you need to know what language the query is in to be able to search by that query in a case-insensitive manner.

    Given that, I would create another field which contains lowercased text from your locale-specific fields, lowercased by your application with knowledge of what language each field value is in, then use the "simple" locale to perform a binary comparison against lowercased query (also lowercased in your application with knowledge of what language the query is in) which will work for any language. This way you should be able to use just one text index.

    This solution does not give you stemming, if you want stemming I imagine you'd need to define one index for each language that might be used for querying.