Search code examples
elasticsearchscoringbooleanquery

Elasticsearch seemingly random scoring and matching


I am using a bool search to match against multiple fields. The fields have been analyzed at index time with multiple filters but mainly use of edge_ngram.

The issue I'm having is that scoring seems to be up in the air. I would expect my search for savvas to match one of my first_name fields of Savvas first, but they are scored much later. For example a search for savvas returns in order of score:

First name | Last name       | Email
___________|_________________|________________________
------     | Sav---          | [email protected]
-----s     | Sa----          | [email protected]
Sa----     | ----            | [email protected]  
Sa----     | --------        | [email protected]
sa-        | -----           | [email protected]
Sa--       | ----s-----s     | [email protected]
Sa----     | -----------     | [email protected]
Savvas     | -------s        | [email protected]
Savvas     | -------s        | [email protected]
Sa-        | ---s----S------ | [email protected]

I have replaced characters other than edge n-grams of the search term in the fields with - and modified the lengths of email to protect identities.

Infact searching for ssssssssssssssss although it doesn't exist in my data returns items with the most number of s characters in them. Something which I wouldn't expect to happen as I am not doing any manual ngrams to my search.

The issue also appears when I attempt to search for a phone number, I match any emails containing the characters 78 when searching for 782 over phone numbers that have 782 as exact ngrams.

It seems that elasticsearch is also performing the ngrams on my search query and not just the field and comparing the two and somehow favouring shorter matches to greater.

Here is my query:

{
    'bool': {
        'should': [ // Any one of these matches will return a result
            {
                'match': {
                    'phone': {
                        'query': $searchString,
                        'fuzziness': '0',
                        'boost': 3 // If phone matches give it precedence
                    }
                }
            },
            {
                'match': {
                    'email': {
                        'query': $searchString,
                        'fuzziness': '0'
                    }
                }
            },
            {
                'multi_match': {
                    'query': $searchString,
                    'type': 'cross_fields', // Match if any term is in any of the fields
                    'fields': ['name.first_name', 'name.last_name'],
                    'fuzziness': '0'
                }
            }
        ],
        'minimum_should_match': 1
    }
}

And the index settings to go with it (apologies for verbosity but I don't want to exclude anything that may be important):

{
    "settings":{
        "analysis":{
            "char_filter":{
                "trim":{
                    "type":"pattern_replace",
                    "pattern":"^\\s*(.*)\\s*$",
                    "replacement":"$1"
                },
                "tel_strip_chars":{
                    "type":"pattern_replace",
                    "pattern":"^(\\(\\d+\\))|^(\\+)|\\D",
                    "replacement":"$1$2"
                },
                "tel_uk_exit_coded":{
                    "type":"pattern_replace",
                    "pattern":"^00(\\d+)",
                    "replacement":"+$1"
                },
                "tel_parenthesized_country_code":{
                    "type":"pattern_replace",
                    "pattern":"^\\((\\d+)\\)(\\d+)",
                    "replacement":"+$1$2"
                }
            },
            "tokenizer":{
                "intl_tel_country_code": {
                    "type":"pattern",
                    "pattern":"\\+(9[976]\\d|8[987530]\\d|6[987]\\d|5[90]\\d|42\\d|3[875]\\d|2[98654321]\\d|9[8543210]|8[6421]|6[6543210]|5[87654321]|4[987654310]|3[9643210]|2[70]|7|1)(\\d{1,14})$",
                    "group":0
                }
            },
            "filter":{
                "autocomplete":{
                    "type":"edge_ngram",
                    "min_gram":1,
                    "max_gram":50
                },
                "autocomplete_tel":{
                    "type":"ngram",
                    "min_gram":3,
                    "max_gram":20
                },
                "email":{
                    "type":"pattern_capture",
                    "preserve_original":1,
                    "patterns":[
                        "([^@]+)",
                        "(\\p{L}+)",
                        "(\\d+)",
                        "@(.+)",
                        "([^-@]+)"
                    ]
                }
            },
            "analyzer":{
                "name":{
                    "type":"custom",
                    "tokenizer":"standard",
                    "filter":[
                        "trim",
                        "lowercase",
                        "asciifolding",
                        "autocomplete"
                    ]
                },
                "email":{
                    "type":"custom",
                    "tokenizer":"uax_url_email",
                    "filter":[
                        "trim",
                        "lowercase",
                        "email",
                        "unique",
                        "autocomplete"
                    ]
                },
                "phone":{
                    "type":"custom",
                    "tokenizer":"intl_tel_country_code",
                    "char_filter":[
                        "trim",
                        "tel_strip_chars",
                        "tel_uk_exit_coded",
                        "tel_parenthesized_country_code"
                    ],
                    "filter":[
                        "autocomplete_tel"
                    ]
                }
            }
        }
    },
    "mappings":{
        "person":{
            "properties":{
                "address":{
                    "properties":{
                        "country":{
                            "type":"string",
                            "index_name":"country"
                        }
                    }
                },
                "timezone":{
                    "type":"string"
                },
                "name":{
                    "properties":{
                        "first_name":{
                            "type":"string",
                            "analyzer":"name"
                        },
                        "last_name":{
                            "type":"string",
                            "analyzer":"name"
                        }
                    }
                },
                "email":{
                    "type":"string",
                    "analyzer":"email"
                },
                "phone":{
                    "type":"string",
                    "analyzer":"phone"
                },
                "id":{
                    "type":"string"
                }
            }
        }
    }
}

I have tested the index settings using the Kopf plugin's analyser and it appears to create the correct tokens.

Ideally I would only match exactly against the tokens created by my index and prioritise a more precise match in one of my bool's should queries as opposed to prioritising multiple bool shoulds matching.

However I would be happy if atleast it matched exact tokens only. I can't use a term search as my search string itself needs to be tokenized, just without applying any ngrams to it.

To sum up my requirements:

  • Score first by largest possible match in any single field.
  • Then score by lowest offset of possible match in any single field.
  • Then score by number of fields matched giving preference to lower offset matches

--- Update: ---

I am getting much better results using a dis_max, it seems to be successfully matching greater ngram matches over multiple ngram matches except for the phone field which is still hard to query. Here's the new query:

{
    'dis_max': {
        'tie_breaker': 0.0,
        'boost': 1.5,
        'queries': [ // Any one of these matches will return a result
            [
                'match': {
                    'phone': {
                        'query': $searchString,
                        'boost': 1.9
                    }
                }
            ],
            [
                'match': {
                    'email': {
                        'query': $searchString
                    }
                }
            ],
            [
                'multi_match': {
                    'query': $searchString,
                    'type': 'cross_fields', // Match if any term is in any of the fields
                    'fields': ['name.first_name', 'name.last_name'],
                    'tie_breaker': 0.1,
                    'boost': 1.5
                }
            ]
        }
    }
}

Solution

  • probably you don't want to be using autocomplete i.e name analyzer on the search string ,only during indexing i.e mapping should be:

    "first_name": {
        "type":"string",
        "index_analyzer":"name"
    }
    

    Also to score matches on first_name higher than last_name in multi-match you could provide a field level boost as follows :

    Example: last_name matches is half as relevant as first_name

    {
        'dis_max': {
            'tie_breaker': 0.0,
            'boost': 1.5,
            'queries': [ // Any one of these matches will return a result
                [
                    'match': {
                        'phone': {
                            'query': $searchString,
                            'boost': 1.9
                        }
                    }
                ],
                [
                    'match': {
                        'email': {
                            'query': $searchString
                        }
                    }
                ],
                [
                    'multi_match': {
                        'query': $searchString,
                        'type': 'cross_fields', // Match if any term is in any of the fields
                        'fields': ['name.first_name', 'name.last_name^0.5'],
                        'tie_breaker': 0.1,
                        'boost': 1.5
                    }
                ]
            }
        }
    }