Search code examples
azureazure-cognitive-search

Customize azure search scoring in a specific way


Consider a scenario where all documents have following fields

enter image description here

The requirement is that for email the score should be either 100 (if exact match) or 0. For remaining fields, it is 0 to 100 based on edit distance .

Suppose in an index the records are like the following

[email protected],Peterr,Parker,Developer [email protected],Steve,Smith,Manager

The query is made on fuzzy search of all the fields and parameters are like [email protected],Pet,Par,Devl

The search result should have a score for first record like

score for email + score of last name +score of first name+score of title

=100+50(approx edit distance of 'Peterr and Pet')+50(approx edit distance of 'Peterr and Parker')+44(approx edit distance of 'Devl and Developer')

=244

Similarly ,the search result should have a score in similar way.

I just checked Azure search scoring has weights but those I don't think would be of much helpful in scenarios like this .The main thing we are looking for is to find a way where the search score returned for each record by Azure search would be in accordance with the score I discussed above


Solution

  • To clarify, it seems what you need is the scoring formula to be a function of the edit distance between the query term and the indexed term - the shorter the distance, the higher the score. Unfortunately, this is not possible in Azure Search.

    Azure Search engine executes the search query in two phases: retrieval and scoring.

    During retrieval search query terms processed by the lexical analyzer are looked up in the inverted index. Documents that had those terms are returned. When you use fuzzy search we expand your search query by adding terms from the inverted index that are within edit distance from a given query term - fuzzy expansion. This way your query can match more documents.

    During scoring we assign a relevance score to retrieved documents using the Lucene scoring formula. This formula is based on TF/IDF. Practically, it means that documents that matched terms that are rare will be ranked higher up in the results set.

    It's important to know that the Lucene scoring formula only applies to documents that matched the original query terms and terms added through fuzzy expansion. Documents that matched terms added through prefix expansion or regex/wildcard expansion are given constant score 1. This way those documents will be in the results set but won't have impact on ranking that's based on frequency of terms.

    Hope that helps