Search code examples
elasticsearchwildcardtf-idfelasticsearch-query

Elasticsearch - search wildcards (contains in strings) and tf-idf scores


how can I make a search wildcard and tf-idf scores. example when I search like this,

GET /test_es/_search?explain=true // return idf / dt scores
{
  "explain":true,
  "query": {
    "query_string": {
      "query": "bar^5",
      "fields"  : ["field"]
    }
  }
}

it returns idf and td score, but when I search like with wildcards (contains).

GET /test_es/_search?explain=true  // NOT RETURN idf/td score
{
   "explain":true,
  "query": {
    "query_string": {
      "query": "b*",
      "fields"  : ["field"]
    }
  }
}

how can I make a search with wildcards (using contains in the string) and include the IDF-TD scores?

for example, I have 3 documents "foo", "foo bar", "foo baz" when I search it like that

GET /foo2/_search?explain=true
{
   "explain":true,
  "query": {
    "query_string": {
      "query": "fo *",
      "fields"  : ["field"]
    }
  }
}

Elasticsearch Result

    "hits" : [
  {
    "_shard" : "[foo2][0]",
    "_node" : "z8bjI0T1T8Oq6Z2OwFyIKw",
    "_index" : "foo2",
    "_type" : "_doc",
    "_id" : "3",
    "_score" : 1.0,
    "_source" : {
      "field" : "foo bar"
    },
    "_explanation" : {
      "value" : 1.0,
      "description" : "sum of:",
      "details" : [
        {
          "value" : 1.0,
          "description" : "*:*",
          "details" : [ ]
        }
      ]
    }
  },
  {
    "_shard" : "[foo2][0]",
    "_node" : "z8bjI0T1T8Oq6Z2OwFyIKw",
    "_index" : "foo2",
    "_type" : "_doc",
    "_id" : "2",
    "_score" : 1.0,
    "_source" : {
      "field" : "foo"
    },
    "_explanation" : {
      "value" : 1.0,
      "description" : "sum of:",
      "details" : [
        {
          "value" : 1.0,
          "description" : "*:*",
          "details" : [ ]
        }
      ]
    }
  },
  {
    "_shard" : "[foo2][0]",
    "_node" : "z8bjI0T1T8Oq6Z2OwFyIKw",
    "_index" : "foo2",
    "_type" : "_doc",
    "_id" : "1",
    "_score" : 1.0,
    "_source" : {
      "field" : "foo baz"
    },
    "_explanation" : {
      "value" : 1.0,
      "description" : "sum of:",
      "details" : [
        {
          "value" : 1.0,
          "description" : "*:*",
          "details" : [ ]
        }
      ]
    }
  }
]

But I expect "foo" should be the first result with having the highest score because it matches %100, am I wrong?


Solution

  • Update 2:

    Wildcard Queries basically falls under Term-level queries, and by default uses the constant_score_boolean method for matching terms.

    By changing the value of the rewrite parameter you can impact search performance and relevance. It has various options for scoring, you can choose any of them according to your requirement.

    But according to your use case, you may also use edge_ngram filter. Edge N-Grams are useful for search-as-you-type queries. To know more about this and the mapping used below refer to this official documentation

    Index Mapping:

    {
      "settings": {
        "analysis": {
          "analyzer": {
            "autocomplete": {
              "tokenizer": "autocomplete",
              "filter": [
                "lowercase"
              ]
            },
            "autocomplete_search": {
              "tokenizer": "lowercase"
            }
          },
          "tokenizer": {
            "autocomplete": {
              "type": "edge_ngram",
              "min_gram": 2,
              "max_gram": 10,
              "token_chars": [
                "letter"
              ]
            }
          }
        }
      },
      "mappings": {
        "properties": {
          "title": {
            "type": "text",
            "analyzer": "autocomplete",
            "search_analyzer": "autocomplete_search"
          }
        }
      }
    }
    

    Index sample data:

    { "title":"foo" }
    { "title":"foo bar" }
    { "title":"foo baz" }
    

    Search Query:

    {
      "query": {
        "match": {
          "title": {
            "query": "fo"
          }
        }
      }
    }
    

    Search Result:

    "hits": [
                {
                    "_index": "foo6",
                    "_type": "_doc",
                    "_id": "1",
                    "_score": 0.15965709,        --> Maximum score
                    "_source": {
                        "title": "foo"
                    }
                },
                {
                    "_index": "foo6",
                    "_type": "_doc",
                    "_id": "2",
                    "_score": 0.12343237,
                    "_source": {
                        "title": "foo bar"
                    }
                },
                {
                    "_index": "foo6",
                    "_type": "_doc",
                    "_id": "3",
                    "_score": 0.12343237,
                    "_source": {
                        "title": "foo baz"
                    }
                }
            ]
    

    To know more about basics of using Ngrams in Elasticsearch you can refer this