Search code examples
mongodblucenemongodb-atlasmongodb-atlas-search

Is there a way to escape accents on mongodb fulltext search


With the new Atlas Search feature is there a way to escape accents.

I did this index :

{
 "analyzer": "lucene.standard",
 "searchAnalyzer": "lucene.standard",
 "mappings": {
   "dynamic": false,
   "fields": {
     "_id": {
       "type": "string",
       "analyzer": "lucene.keyword"
     },
     "firstName": {
       "type": "string",
       "analyzer": "lucene.french"
     },
     "lastName": {
       "type": "string",
       "analyzer": "lucene.french"
     },
     "email": {
       "type": "string",
       "analyzer": "lucene.standard"
     }
   }
 }
}

With this data :

db.testJTAFulltextSearch.insert({_id: "testFTS3", firstName: "René", lastName: "Martin", email: "rmartin@gmail.com"})
db.testJTAFulltextSearch.insert({_id: "testFTS4", firstName: "Rene", lastName: "Martin", email: "rmartin@gmail.com"})

And with this search :

db.testJTAFulltextSearch.aggregate([{$searchBeta: {index: "customer", text: {query: "René", path: ["_id", "firstName", "email"]}}}])

I got :

{ "_id" : "testFTS3", "firstName" : "René", "lastName" : "Martin", "email" : "rmartin@gmail.com" }

The accents are not escaped (é is supposed to be handled like a e). I was expecting :

{ "_id" : "testFTS3", "firstName" : "René", "lastName" : "Martin", "email" : "rmartin@gmail.com" }
{ "_id" : "testFTS4", "firstName" : "Rene", "lastName" : "Martin", "email" : "rmartin@gmail.com" }

Is there a way to escape accents (diacritics) with Mongodb Atlas Search ?

I guess that I need an ascii folding analyzer but I did not find it in the list of the analyzers : https://docs.atlas.mongodb.com/reference/atlas-search/analyzers/#analyzers-ref

Usage of collation does not seems to work :

db.testJTAFulltextSearch.aggregate([{$searchBeta: {index: "customer", text: {query: "René", path: ["_id", "firstName",
 "email"]}}}], {collation: {locale: "en", strength: 1}})

Still returns only "René"


Solution

  • Have you tried the fuzzy config? It doesn't seem to be enabled by default, but fuzzy: { maxEdits: 2 } should have you covered.

    I had a similar issue recently but found out that was actually my fault for setting the wrong config (prefixLength: 1 instead of the default value 0) there - see the thread. In my case, I'm using the term operator instead of text, but I'm not sure how relevant that is.