Search code examples
elasticsearchelasticsearch-java-apielasticsearch-7

Match all exact words in a query


I want to create a query using the ElasticSearch Java API which only matches (1) complete words and (2) all of the words from the searchquery. Here is an example for that:

Text:

hello wonderful world

These should match:

hello
hello wonderful
hello world
wonderful world
hello wonderful world
wonderful
world

These shouldn't match:

  1. hell

  2. hello fniefsgbsugbs

I tried the following parameters for a match query but it still matches the two examples above.

This is the code to generate the query using the ElasticSearch 7.7.1 Java API:

import org.elasticsearch.index.query.QueryBuilders
...

QueryBuilders.matchQuery(field, query)
            .autoGenerateSynonymsPhraseQuery(false)
            .fuzziness(0)
            .prefixLength(0)
            .fuzzyTranspositions(false)
            .operator(Operator.AND)
            .minimumShouldMatch("100%")

Which will generate this query:

{
  "size": 100,
  "query": {
    "bool": {
      "filter": [
        {
          "match": {
            "searchableText": {
              "query": "hell",
              "operator": "AND",
              "fuzziness": "0",
              "prefix_length": 0,
              "max_expansions": 50,
              "minimum_should_match": "100%",
              "fuzzy_transpositions": false,
              "lenient": false,
              "zero_terms_query": "NONE",
              "auto_generate_synonyms_phrase_query": false,
              "boost": 1
            }
          }
        }
      ]
    }
  }
}

Can someone help me to find a good solution for this?

Edit: Here are the settings and mappings (I removed everything which isn't relevant for searchableText to make it as minimal as possible):

{
    "settings": {
      "analysis": {
        "normalizer": {
          "lowercase_normalizer": {
            "type": "custom",
            "filter": [
              "lowercase"
            ]
          }
        },
        "filter": {
          "german_stemmer": {
            "type": "stemmer",
            "language": "light_german"
          },
          "ngram_filter": {
            "type": "shingle",
            "max_shingle_size": 4,
            "min_shingle_size": 2,
            "output_unigrams": false,
            "output_unigrams_if_no_shingles": false
          }
        },
        "analyzer": {
          "german": {
            "tokenizer": "standard",
            "filter": [
              "lowercase",
              "german_synonyms",
              "german_stop",
              "german_keywords",
              "german_no_stemming",
              "german_stemmer"
            ]
          },
          "german_ngram": {
            "tokenizer": "standard",
            "filter": [
              "lowercase",
              "german_synonyms",
              "german_keywords",
              "german_no_stemming",
              "german_stemmer",
              "ngram_filter"
            ]
          }
        }
      }
    },
    "mappings": {
      "properties": {
        "description": {
          "type": "text",
          "copy_to": "searchableText",
          "analyzer": "german"
        },
        "name": {
          "type": "text",
          "copy_to": "searchableText",
          "analyzer": "german"
        },
        "userTags": {
          "type": "keyword",
          "copy_to": "searchableText",
          "normalizer": "lowercase_normalizer"
        },
        "searchableText": {
          "type": "text",
          "analyzer": "german",
          "fields": {
            "ngram": {
              "type": "text",
              "analyzer": "german_ngram"
            }
          }
        },
        "searches": {
          "type": "keyword",
          "copy_to": "searchableText",
          "normalizer": "lowercase_normalizer"
        }
      }
    }
  }

Edit 2: These are the mentioned filters:

"filter": {
    "german_stop": {
      "type": "stop",
      "stopwords": "_german_"
    },
    "german_stemmer": {
      "type": "stemmer",
      "language": "light_german"
    },
    "ngram_filter": {
      "type": "shingle",
      "max_shingle_size": 4,
      "min_shingle_size": 2,
      "output_unigrams": false,
      "output_unigrams_if_no_shingles": false
    }
}

Solution

  • I tried creating index with your setting and mapping, but as below filters were not provided, I got the error and created index after removing these filters.

    "german_synonyms",
    "german_stop",
    "german_keywords",
    "german_no_stemming",
    

    After that I indexed, your sample doc hello wonderful world and used your search query but it works fine as your expected and didn't return a result for hell or hello fniefsgbsugbs As shown below

    {
        "size": 100,
        "query": {
            "bool": {
                "filter": [
                    {
                        "match": {
                            "searchableText": {
                                "query": "hello fniefsgbsugbs",
                                "operator": "AND",
                                "fuzziness": "0",
                                "prefix_length": 0,
                                "max_expansions": 50,
                                "minimum_should_match": "100%",
                                "fuzzy_transpositions": false,
                                "lenient": false,
                                "zero_terms_query": "NONE",
                                "auto_generate_synonyms_phrase_query": false,
                                "boost": 1
                            }
                        }
                    }
                ]
            }
        }
    }
    

    And it returns

    "hits": {
            "total": {
                "value": 0,
                "relation": "eq"
            },
            "max_score": null,
            "hits": []
        }
    
    

    Ans same is with hell , while it returns result with hello, hello wonderful and other terms which is expected to match.

    EDIT: You are using match query which is analyzed ie, it analyzes the search term, applies the same analyzer which is applied index time on the field, and matches the search time tokens to index time tokens.

    In order to properly debug these kinds of issues, please use the analyze API and inspect your indexed document tokens and search terms tokens.