Search code examples
elasticsearch

how does elasticsearch ngrams scoring work?


I have two documents in my index. One contains field :

 name: foo bar

and another

 name: foo xyz bar xyz foo xyz bar xyz foo xyz bar xyz foo xyz bar

I'm using ngrams analyzer like this:

"analysis": {
  "analyzer": {
    "ngram_analyzer": {
      "tokenizer": "ngram_tokenizer"
    }
  },
  "tokenizer": {
    "ngram_tokenizer": {
      "type": "ngram",
      "min_gram": 3,
      "max_gram": 3,
      "token_chars": [
        "letter",
        "digit",
        "whitespace"
      ]
    }
  }
}

and when I search for foo bar first document gets higher score then second. This is what I want but can anybody explain how does this scoring work? as I know ngram splits them in 3 character length terms and how does it founds out that foo and bar are in sequence in first document and assigns to it higher score?


Solution

  • Relevance/scoring in Elasticsearch is not the easiest part when you are starting. Score calculation is based on three main parts:

    • Term frequency
    • Inverse document frequency
    • Field-length norm

    Shortly:

    • the often the term occurs in field, the MORE relevant is
    • the often the term occurs in entire index, the LESS relevant is
    • the longer the term is, the MORE relevant is

    I recommend you to read below materials:

    But additionally score will depend on type of query you are using. For example for match query foo bar search term better suits the foo bar document than the second one.