Search code examples
elasticsearchautocompleteelasticsearch-queryelasticsearch-mappingelasticsearch-analyzers

Elastic search edge ngram not returning all expected results


I am having a hard time in finding the elastic search query unexpected results. Indexed the following documents into elastic search.

{
"group": "J00-I99", codes: [
   { "id": "J15", "description": "hello world" },
   { "id": "J15.0", "description": "test one world" },
   { "id": "J15.1", "description": "test two world J15.0" },
   { "id": "J15.2", "description": "test two three world J15" },
   { "id": "J15.3", "description": "hello world J18 " },
    ............................ // Similar records here
   { "id": "J15.9", "description": "hello world new" },
   { "id": "J16.0", "description": "new description" }
]
}

Here my aim is to implement autocomplete functionality and for that I used n-gram approach. I don't want to use complete suggester approach.

Currently I am stuck with two issues:

  1. Search query (both id and description fields ) : J15

Expected result: All the above results which includes J15 Actual result: Getting only few results (J15.0, J15.1, J15.8)

  1. Search query (both id and description fields ) : test two

Expected result:

{ "id": "J15.1", "description": "test two world J15.0" },
{ "id": "J15.2", "description": "test two three world J15" },

Actual Result:

   { "id": "J15.0", "description": "test one world" },
   { "id": "J15.1", "description": "test two world J15.0" },
   { "id": "J15.2", "description": "test two three world J15" },

Then mapping is done like this.

           {

                settings: {
                    number_of_shards: 1,
                    analysis: {
                        filter: {
                            ngram_filter: {
                                type: 'edge_ngram',
                                min_gram: 2,
                                max_gram: 20
                            }
                        },
                        analyzer: {
                            ngram_analyzer: {
                                type: 'custom',
                                tokenizer: 'standard',
                                filter: [
                                    'lowercase', 'ngram_filter'
                                ]
                            }
                        }
                    }
                },
                mappings: {
                    properties: {
                        group: {
                            type: 'text'
                        },
                        codes: {
                            type: 'nested',
                            properties: {
                                id: {
                                    type: 'text',
                                    analyzer: 'ngram_analyzer',
                                    search_analyzer: 'standard'
                                },
                                description: {
                                    type: 'text',
                                    analyzer: 'ngram_analyzer',
                                    search_analyzer: 'standard'
                                }
                            }
                        }
                    }
                }
            }

Search Query:

GET myindex/_search
{
  "_source": {
    "excludes": [
      "codes"
    ]
  },
  "query": {
    "nested": {
      "path": "codes",
      "query": {
        "bool": {
          "should": [
            {
              "match": {
                "codes.description": "J15"
              }
            },
            {
              "match": {
                "codes.id": "J15"
              }
            }
          ]
        }
      },
      "inner_hits": {}
    }
  }
}

Note: Document index will be large in size. Here only sample data mentioned.

For the second issue, can i use multi_match with AND operator like the below?

GET myindex/_search
{
  "_source": {
    "excludes": [
      "codes"
    ]
  },
  "query": {
    "nested": {
      "path": "codes",
      "query": {
        "bool": {
          "should": [
            {
              "multi_match": {
                    "query": "J15",
                    "fields": ["codes.id", "codes.description"],
                    "operator": and
                }
            }
          ]
        }
      },
      "inner_hits": {}
    }
  }
}

Any help would be really appreciated as I am having hard time in fixing this.


Solution

  • Adding another answer, as its a different issue and first answer was focused on first issue.

    Issue is that your second query test two returns test one world as well as while indexing you are using the ngram_analyzer which is using the standard analyzer which split the text on white-spaces and again your search analyzer is standard so if you use the Analyze API on your indexed doc and search term, you will see it matches the tokens:

    {
       "text" : "test one world",
       "analyzer" : "standard"
    }
    

    And generated tokens

    {
        "tokens": [
            {
                "token": "test",
                "start_offset": 0,
                "end_offset": 4,
                "type": "<ALPHANUM>",
                "position": 0
            },
            {
                "token": "one",
                "start_offset": 5,
                "end_offset": 8,
                "type": "<ALPHANUM>",
                "position": 1
            },
            {
                "token": "world",
                "start_offset": 9,
                "end_offset": 14,
                "type": "<ALPHANUM>",
                "position": 2
            }
        ]
    }
    

    And for your search term test two

    {
        "tokens": [
            {
                "token": "test",
                "start_offset": 0,
                "end_offset": 4,
                "type": "<ALPHANUM>",
                "position": 0
            },
            {
                "token": "two",
                "start_offset": 5,
                "end_offset": 8,
                "type": "<ALPHANUM>",
                "position": 1
            }
        ]
    }
    

    As you can see test token was present in your document hence you get that search result. and it can be solved by using the AND operator in the query as shown below

    Search query

    {
        "_source": {
            "excludes": [
                "codes"
            ]
        },
        "query": {
            "nested": {
                "path": "codes",
                "query": {
                    "bool": {
                        "must": {
                            "multi_match": {
                                "query": "test two",
                                "fields": [
                                    "codes.id",
                                    "codes.description"
                                ],
                                "operator" :"AND"
                            }
                        }
                    }
                },
                "inner_hits": {}
            }
        }
    }
    

    And search results

     "hits": [
                                    {
                                        "_index": "myindexedge64170045",
                                        "_type": "_doc",
                                        "_id": "1",
                                        "_nested": {
                                            "field": "codes",
                                            "offset": 2
                                        },
                                        "_score": 2.6901608,
                                        "_source": {
                                            "id": "J15.1",
                                            "description": "test two world J15.0"
                                        }
                                    },
                                    {
                                        "_index": "myindexedge64170045",
                                        "_type": "_doc",
                                        "_id": "1",
                                        "_nested": {
                                            "field": "codes",
                                            "offset": 3
                                        },
                                        "_score": 2.561376,
                                        "_source": {
                                            "id": "J15.2",
                                            "description": "test two three world J15"
                                        }
                                    }
                                ]
                            }
                        }
                    }
                }