Search code examples
azure-cognitive-searchazure-search-.net-sdk

Azure Search - Search Highlight values break when the searchable value has a sentence separator


Hello Azure Search Team,

Sorry if the question seems big but I wanted to explain it with some data which might make the question look verbose.

I'm from PowerBI team and have a question based on the documentation of the Search Highlight feature that we have in Azure Search.

I created an Azure Based Search index yesterday with a sample document like below.

"DocumentId": "257d13f0-ea1f-412f-9858-baa49b35f6b5",
"ModelId": "78869cb7-352e-4415-911e-464308c6d8d9",
"TableId": "Employees",
"ColumnId": "Details",
"ColumnValues": [
    "Boston Massachusetts",
    "Tampa Florida",
    "Palo Alto California",
    "Sentenceeeeeeeeeeeeeeeeeeeeeee with 101 characters tokenwith50characterssssssssssssssssssssssssssssss",
    "Data is repeated Data is repeated Data is repeated",
    "Data is repeated. Data is repeated. Data is repeated.",
    "Washington",
    "Washington D.C"
]

Note that only the "ColumnValues" is searchable. Also, notice the repeated values in ColumnValues[4] and ColumnValues[5] with and without a English sentence separator(.) (Assuming index starts at 0).

Now, if a user searches for "Data", we'd pass the below search query to Azure Search:

\"/.*Data.*/\" &queryType=full &highlight=ColumnValues-100&highlightPreTag=''&highlightPostTag=" &searchMode=any &$top=1500 &$count=true

Below is the response from Azure Search API in the search portal:

{
    "@odata.context": "https://huynazuresearch1.search.windows.net/indexes('columnbasedindex')/$metadata#docs(*)",
    "@odata.count": 1,
    "value": [
        {
            "@search.score": 1,
            "@search.highlights": {
                "ColumnValues": [
                    "''Data\"  is repeated ''Data\"  is repeated ''Data\"  is repeated",
                    "''Data\"  is repeated.",
                    "''Data\"  is repeated.",
                    "''Data\"  is repeated."
                ]
            },
            "DocumentId": "257d13f0-ea1f-412f-9858-baa49b35f6b5",
            "ModelId": "78869cb7-352e-4415-911e-464308c6d8d9",
            "TableId": "Employees",
            "ColumnId": "Details",
            "ColumnValues": [
                "Boston Massachusetts",
                "Tampa Florida",
                "Palo Alto California",
                "Sentenceeeeeeeeeeeeeeeeeeeeeee with 101 characters tokenwith50characterssssssssssssssssssssssssssssss",
                "Data is repeated Data is repeated Data is repeated",
                "Data is repeated. Data is repeated. Data is repeated.",
                "Washington",
                "Washington D.C"
            ]
        }
    ]
}

Now, we get the document in return as expected but we do some processing on Search Highlight values returned by Azure Search.

For our needs, we need to form an ColumnInfo object of {ColumnId , ColumnValues} for each match. To do that, we iterate over the @search.highlights array and try to map each highlighted value to the respective ColumnValues.

Now, for the first value in @search.highlights.ColumnValues - "''Data\" is repeated ''Data\" is repeated ''Data\" is repeated", we can easily map it to ColumnValues[4] by an equals kind of a match.

So, we can form a ColumnInfo object {"Details", "Data is repeated Data is repeated Data is repeated"} easily. However, for the remaining values (index 1,2 & 3) in @search.highlights.ColumnValues - we see that all 3 of them ("''Data" is repeated.") map to the ColumnValues[5].

I see a issue with this. When the searchable value has a . (some delimiter), the search highlight breaks itself there and hence does not return the entire instance of ColumnValues field.

As we are interested in building the ColumnInfo object of {ColumnId , ColumnValues}, we are interested in the entire value of ColumnValue instance and not parts/highlights of it.

Is there anyway, we can override this behavior and let Azure Search return the entire string for the respective ColumnValue that was matched, as part of Search Highlight? Having this will avoid us to do a Contains kind of match after getting results from Azure search to construct the custom ColumnInfo object of {ColumnId , ColumnValues}.

I wanted to see what are the suggested options for this. Apologies if the question is verbose, I'm happy to schedule a short call to discuss if needed.

Thanks, Sagar


Solution

  • I am from the Azure Cognitive Search engineering team. Thanks for the detailed post which helped me understand your usecase.

    Unfortunately, there is no mechanism to override how the text is fragmented during highlighting process in Azure Search. The decision to split on sentence boundaries was made to align with the most common scenario in highlighting where users want specific parts of text with highlights instead of the complete text.

    There is also the confusion between input ColumnValue collection field and the collection returned as highlights. These are not the same and items can't be co-related. The highlights contains a collection of highlighted fragments from the entire field text, and from highlighting perspective all the items in the collection form the field text.

    This usecase will have to be handled on the client side by parsing the original input collection and checking the items for the query term.