Search code examples
azure-cosmosdbazure-cognitive-search

Suggest on a field that is the concatenation of other existing fields


I have azure cosmosdb database which contains a user collection User (id, login, firstname, lastname, etc.).

I am using Azure cognitive search index to search/suggest over users and other collections.

My use case appears very simple, but I'm not finding the best way to solve it.

I want to make suggestions not only on the user first and last name but also on the fullname, which is a property not present in the database schema and can be the concatenation of the firstname + lastname or lastname + firstname.

For example, if I have a user named "John Doe", typing "John" , "Doe John," or "Doe" should return the same user.

Index fields

{
            "name": "user_first_name",
            "type": "Edm.String",
            "searchable": true,
            "filterable": false,
            "retrievable": true,
            "sortable": true,
            "facetable": false,
            "key": false,
            "indexAnalyzer": null,
            "searchAnalyzer": null,
            "analyzer": "standard.lucene",
            "normalizer": null,
            "synonymMaps": []
        },
        {
            "name": "user_last_name",
            "type": "Edm.String",
            "searchable": true,
            "filterable": false,
            "retrievable": true,
            "sortable": true,
            "facetable": false,
            "key": false,
            "indexAnalyzer": null,
            "searchAnalyzer": null,
            "analyzer": "standard.lucene",
            "normalizer": null,
            "synonymMaps": []
        },
        {
            "name": "user_full_name",
            "type": "Edm.String",
            "searchable": true,
            "filterable": false,
            "retrievable": true,
            "sortable": false,
            "facetable": false,
            "key": false,
            "indexAnalyzer": null,
            "searchAnalyzer": null,
            "analyzer": "standard.lucene",
            "normalizer": null,
            "synonymMaps": []
        },

I've tried the Skillset MergeSkillset but can't get a successful result:

{
      "@odata.type": "#Microsoft.Skills.Text.MergeSkill",
      "name": "skillset_users_merge_fullname",
      "description": "",
      "context": "/document",
      "insertPreTag": "",
      "insertPostTag": " ",
      "inputs": [
        {
          "name": "text",
          "source": "/document/user_first_name"
        },
        {
          "name": "itemsToInsert",
          "source": "/document/user_last_name"
        }
      ],
      "outputs": [
        {
          "name": "mergedText",
          "targetName": "user_full_name"
        }
      ]
    }

There is also another alternative that I have found, which is the synonymMaps but it can't be applied in my use case as iI have more than 20k users and the synonym maps don't allow more than 20k rules.

What is the best way to do this kind of suggestions?
Thanks.


Solution

  • Solution 1: Deal with problem at source

    Let's be honest - having complex logic in skillset is pain to develop and test. If possible, deal with problem at source.

    In most databases you can create a view. For example in SQL, you do it like this:

    CREATE VIEW view_for_indexer AS 
    SELECT first_name, last_name, CONCAT(first_name, ' ', last_name) AS full_name
    FROM people
    

    And then you modify your datasource to select the view instead of the table.

    Unfortunately, modifying database schema may not always be an option...

    Solution 2: Concat text in indexer

    The problem you had is that the MergeSkill's argument itemsToInsert is not a string but an array of strings. So you do NOT call it like this MergeSkill("John", "Smith") but like this: MergeSkill("John", ["Smith"]).

    To transform "Smith" -> ["Smith"], you can use SplitSkill.

      {
          "@odata.type": "#Microsoft.Skills.Text.SplitSkill",
          "name": "create_last_name_array",
          "description": null,
          "context": "/document",
          "defaultLanguageCode": "en",
          "textSplitMode": "pages",
          "maximumPageLength": 50000,
          "pageOverlapLength": 0,
          "maximumPagesToTake": 1,
          "inputs": [
            {
              "name": "text",
              "source": "/document/last_name"
            }
          ],
          "outputs": [
            {
              "name": "textItems",
              "targetName": "last_name_array"
            }
          ]
        },
        {
          "@odata.type": "#Microsoft.Skills.Text.MergeSkill",
          "name": "form_full_name",
          "description": null,
          "context": "/document",
          "insertPreTag": " ",
          "insertPostTag": " ",
          "inputs": [
            {
              "name": "text",
              "source": "/document/first_name"
            },
            {
              "name": "itemsToInsert",
              "source": "/document/last_name_array"
            }
          ],
          "outputs": [
            {
              "name": "mergedText",
              "targetName": "full_name"
            }
          ]
        }
      ],
    

    Normally, SplitSkill is intended to split long text into smaller pages. But since the last_name is very short and I set "maximumPageLength": 50000,, it will always return an array with a single element (or empty array if last_name is empty).