Search code examples
azure-cognitive-searchsynonym

Is there a way to make multi term synonyms work in Azure Cognitive Search


I have problems getting synonyms with more than one term to work.

To illustrate my problem, I have created a minimal index with four items describing hotels, loosely based on the hotels-example from the Azure Cognitive Search documentation.

{
    "value": [
        {
            "Id": "1",
            "Title": "Fancy stay, luxury, hotel, wifi, break fast"
        },
        {
            "Id": "2",
            "Title": "Roach Motel, budget, motel, internet, morning meal"
        },
        {
            "Id": "3",
            "Title": "Mediocre Inn, cheap, bed & breakfast, wi-fi, breakfast"
        },
        {
            "Id": "4",
            "Title": "Ok Stay, cost efficient, bed and breakfast, wi fi, breakfast"
        }
    ]
}

Each hotel item describes the same types of amenities but in an unnormalized way. As an example, they all describe that they have internet, but they use different terms in content:

  • wifi
  • internet
  • wi-fi
  • wi fi

Users searching for hotels will be equally unnormalized. We want to enable users to return all of the above as matches when they use any of the above search terms.

We can submit a synonym map to do this:

{
    "format": "solr",
    "synonyms": "wifi,wi-fi,internet,wi fi"
}

Synonyms defined with commas as separators are two-way synonyms. This means any of the terms will be equivalent to any of the other terms. Except wi fi, which does not work as expected because it's more than one token.

QUERIES

  • wifi: returns all 4, as expected
  • internet: returns all 4, as expected
  • wi-fi: returns all 4, as expected
  • wi fi: returns only 2 hits (the ones with wi-fi and wi fi)

I understand that the problem is that a query consisting of wi fi is two separate tokens. Unexpectedly, synonym lookup does not transform wi fi as expected.

WORKAROUND

A known workaround is to change the query to a phrase-query, so it becomes "wi fi".

  • "wi fi": returns all 4 hits, as expected

However, the end-user query may consist of multiple terms, like

hotel affordable wi fi breakfast

So, I cannot wrap the entire query in quotes as it would not match anything. Can anyone suggest a workaround to get the built-in synonym functionality to work for this use case? It's not hard to see that many similar examples require synonyms with multiple terms to work.

  • affordable, cost efficient, cheap
  • break fast, breakfast, morning meal
  • ...

PS: We are using the SDK to index content. We have extensive pre-processing of content, using regular C# to manipulate the content and data model as we wish. The same goes for the front end, where we manipulate the query using code we control.

Any creative suggestions are welcome.


Solution

  • Great question and write-up. This is a known challenge because the terms wi and fi are split into separate tokens before they reach the synonym map, as you mentioned.

    One workaround I've used successfully is using a synonym token filter to do the expansion from single term to multiple terms at indexing time.

    For example, with the custom analyzer below, if a document has wifi in it, then wi fi will also be added to the inverted index so you'd get a match for all documents when searching wi fi.

        "analyzers": [
            {
                "@odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
                "name": "synonym-analyzer",
                "tokenizer": "microsoft_language_tokenizer",
                "tokenFilters": [
                    "synonym-filter"
                ],
                "charFilters": []
            }
        ],
        "normalizers": [],
        "tokenizers": [],
        "tokenFilters": [
            {
                "@odata.type": "#Microsoft.Azure.Search.SynonymTokenFilter",
                "name": "synonym-filter",
                "synonyms": [
                    "wifi,wi-fi,internet,wi fi"
                ],
                "expand": true
            }
        ]
    
    

    You could choose to use this analyzer just at indexing time or at both indexing and query time depending on your preference. One downside of the synonym token filter is that, unlike synonym maps, they are immutable so you can't change them without recreating your index.