Search code examples
searchelasticsearchfull-text-searchsearch-engine

How to build an Elasticsearch query that will take into account the distance between words and the exactitude of the word


I would like to implement an intranet site search with the help of Elasticsearch but i can't find the query formula that will answer all my needs.

Here's the criterias that i would like to apply to my search when searching for 2+ words:

  • the closer are the words in the content text, the higher will be the score
  • when it find the exact word, it should give it an higher score than if it must replace letters in the word in a fuzzy search
  • if all words are present in the text, it should have an higher score than if it just find one

Here's a demo of my search query where you can play online: https://www.found.no/play/gist/6df91cb4ed1f2b4b7328

When i do search for "toll collector", i get the result in that order:

  1. Is your toll something connector wearing pants
  2. Is your toll connector wearing pants
  3. Is your toll collector wearing pants
  4. Is your toll doing something collector wearing pants
  5. Is your toll something collector wearing pants

But why the exact match is in the third place? Why not in the first position? What i want is this result:

  1. Is your toll collector wearing pants
  2. Is your toll something collector wearing pants
  3. Is your toll doing something collector wearing pants
  4. Is your toll connector wearing pants
  5. Is your toll something connector wearing pants

Solution

  • Your query doesn't take word order into account.

    To do so, you need to add "type": "phrase" to your query. This does the same thing as replacing "match" by "match_phrase".

    You then get a single document, your desired #1.

    To allow in-between words, you add "slop": 2

    You then get the first three desired documents, in the right order. But the "fuzziness" parameter seems to have no effect in phrase mode.

    To also get the "connector" answers, you can group the two queries in a "should" clause :

    query:
        bool:
            should:
            - match_phrase:
                 description:
                     query: "toll collector"
                     slop: 2
            - match:
                 description:
                     query: "toll collector"
                     fuzziness: 2
    

    This adds the "connector" answers, but their score does not take the in-between words into account.

    To do so, you would need some kind of distance score that encapsulates both phrase sloppiness and word fuzziness. It don't know if this is implemented, but if it exists, it's going to be computationally expensive for order-2 edits on both sides.