Search code examples
azuresearchazure-cognitive-search

Azure Search: Searching for singular version of a word, but still include plural version in results


I have a question about a peculiar behavior I noticed in my custom analyzer (as well as in the fr.microsoft analyzer). The below Analyze API tests are shown using the “fr.microsoft” analyzer, but I saw the same exact behavior when I use my “text_contains_search_custom_analyzer” custom analyzer (which makes sense as I base it off the fr.microsoft analyzer).

UAT reported that when they search for “femme” (singular) they expect documents with “femmes” (plural) to also be found. But when I tested with the Analyze API, it appears that the Azure Search service only tokenizes plural -> plural + singular, but when tokenizing singular, only singular tokens are used. See below for examples.

Is there a way I can allow a user to search for the singular version of a word, but still include the plural version of that word in the search results? Or will I need to use synonyms to overcome this issue?

Request with “femme” { "analyzer": "fr.microsoft", "text": "femme" }

Response from “femme” { "@odata.context": "https://EXAMPLESEARCHINSTANCE.search.windows.net/$metadata#Microsoft.Azure.Search.V2016_09_01.AnalyzeResult", "tokens": [ { "token": "femme", "startOffset": 0, "endOffset": 5, "position": 0 } ] }

Request with “femmes” { "analyzer": "fr.microsoft", "text": "femmes" }

Response from “femmes” { "@odata.context": "https://EXAMPLESEARCHINSTANCE.search.windows.net/$metadata#Microsoft.Azure.Search.V2016_09_01.AnalyzeResult", "tokens": [ { "token": "femme", "startOffset": 0, "endOffset": 6, "position": 0 }, { "token": "femmes", "startOffset": 0, "endOffset": 6, "position": 0 } ] }


Solution

  • Just to add to yoape's response, the fr.microsoft analyzer reduces inflected words to their base form. In your case, the word femmes is reduced to its singular form femme. All cases that you described will work:

    1. Searching with the base form of a word if an inflected form was in the document.

      Let's say you're indexing a document with Vive with Femmes.
      The search engine will index the following terms: vif, vivre, vive, femme, femmes.
      If you search with any of these terms e.g., femme, the document will match.

    2. Searching with an inflected form of a word if the base form was in the document.

      Let's say you're indexing a document with teext Femme fatale.
      The search engine will index the following terms: femme, fatal, fatale.
      If you search with term femmes, the analyzer will produce also its base form. Your query will become femmes OR femme. Documents with any of these terms will match.

    3. Searching with an inflected from if another inflected form of that word was in the document.

      If you have a document with allez, terms allez and aller will be indexed.
      If you search for alle, the query becomes alle OR aller. Since both inflected forms are reduced to the same base form the document will match.

    The key learning here is that the analyzer processes the documents but also query terms. Terms are normalized accounting for language specific rules.

    I hope that explains it.