Search code examples
elasticsearchelastic-stackshingles

Tokenize input text using shingles and compare them with document keywords


We have an index. It has a keyword field. A document may have keywords such as: ['cheesecake', 'cinnamon roll'].

If the input text contains the word 'cheesecake' there is no problem. But if the input text is something like 'Today I have eaten a cinnamon roll', there is no matching. We think the problem is that the input text is tokenized into single words, so neither 'cinnamon' nor 'roll' match our keyword 'cinnamon roll' (and we don't want to! Only 'cinnamon roll' must match the keyword 'cinnamon roll').

How could we solve that? We thought of using shingles, but we didn't find the proper way. And it is only the input search text that we need to tokenize.

This is our current query:

GET /food-suggestion/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "keywords": {
              "query": "cinnamon roll",
              "analyzer": "standard",
              "operator": "or"
            }
          }
        }
      ],
      "filter": [
        {
          "term": {
            "languageId": 1
          }
        },
        {
          "term": {
            "webId": 2
          }
        }
      ]
    }
  }
}

Index mapping:

  • description

    Text

  • id

    Integer

  • keywords

    Keyword

  • languageId

    Integer

  • foodId

    Long

  • title

    Text

  • webId

    Integer

This is a document of the index:

{
          "description": "Bla bla bla",
          "keywords": [
            "cinnamon roll",
            "crema catalana",
            "cheesecake",
          ],
          "languageId": 1,
          "foodId": 13,
          "title": "Sample title",
          "webId": 2
        }

Solution

  • You are thinking in right direction. You can use shingle tokenizer to solved this issue.

    You can create analyzer as shown below:

    PUT test/_settings
    {
      "settings": {
        "analysis": {
          "analyzer": {
            "standard_shingle": {
              "tokenizer": "standard",
              "filter": [
                "lowercase",
                "shingle"
              ],
              "min_shingle_size": 1,
              "max_shingle_size": 15
            }
          }
        }
      }
    }
    

    You can use below query to get desired result:

    GET test/_search
    {
      "query": {
        "bool": {
          "must": [
            {
              "match": {
                "keywords": {
                  "query": "cinnamon roll",
                  "analyzer": "standard_shingle"
                }
              }
            }
          ],
          "filter": [
            {
              "term": {
                "languageId": 1
              }
            },
            {
              "term": {
                "webId": 2
              }
            }
          ]
        }
      }
    }
    

    Above query will not give the result if you pass just only cinnamon or roll as single keyword query.

    Below are the few things to consider:

    1. Please make sure that your keywords field have data in lower case as keyword type of field work as case sensitive.
    2. There might be some corner scenario for which this might not work, This you can find out based on your usecase and creating sample data.
    3. If you need to update existing index with analyzer then first close the index and update setting as shown above and then open index again.