Search code examples
elasticsearchelasticsearch-java-apielasticsearch-query

Why does my Elasticsearch multi-match query look only for prefixes?


I am trying to write an Elasticsearch multi-match query (with the Java API) to create a "search-as-you-type" program. The query is applied to two fields, title and description, which are analyzed as ngrams.

My problem is, it seems that Elasticsearch tries to find only words beginning like my query. For instance, if I search for "nut", then it matches with documents featuring "nut", "nuts", "Nutella", etc, but it does not match documents featuring "walnut", which should be matched.

Here are my settings :

{
    "index": {
        "analysis": {
            "analyzer": {
                "edgeNGramAnalyzer": {
                    "tokenizer": "edgeTokenizer",
                    "filter": [
                        "word_delimiter",
                        "lowercase",
                        "unique"
                    ]
                }
            },
            "tokenizer": {
                "edgeTokenizer": {
                  "type": "edgeNGram",
                  "min_gram": "3",
                  "max_gram": "8",
                  "token_chars": [
                    "letter",
                    "digit"
                  ]
                }
            }
        }
    }
}

Here is the relevant part of my mapping :

{
    "content": {
        "properties": {
            "title": {
                "type": "text",
                "analyzer": "edgeNGramAnalyzer",
                "fields": {
                    "sort": { 
                        "type": "keyword"
                    }
                }
            },
            "description": {
                "type": "text",
                "analyzer": "edgeNGramAnalyzer",
                "fields": {
                    "sort": { 
                        "type": "keyword"
                    }
                }
            }
        }
    }
}

And here is my query :

new MultiMatchQueryBuilder(query).field("title", 3).field("description", 1).fuzziness(0).tieBreaker(1).minimumShouldMatch("100%")

Do you have any idea what I could be doing wrong ?


Solution

  • That's because you're using an edgeNGram tokenizer instead of nGram one. The former only indexes prefixes, while the latter will index prefixes, suffixes and also sub-parts of your data.

    Change your analyzer definition to this instead and it should work as expected:

    {
        "index": {
            "analysis": {
                "analyzer": {
                    "edgeNGramAnalyzer": {
                        "tokenizer": "edgeTokenizer",
                        "filter": [
                            "word_delimiter",
                            "lowercase",
                            "unique"
                        ]
                    }
                },
                "tokenizer": {
                    "edgeTokenizer": {
                      "type": "nGram",         <---- change this
                      "min_gram": "3",
                      "max_gram": "8",
                      "token_chars": [
                        "letter",
                        "digit"
                      ]
                    }
                }
            }
        }
    }