Search code examples
elasticsearchkibanaelasticsearch-5

Counting occurrences of search terms in Elasticsearch function score script


I have an Elasticsearch index with document structure like below.

{
  "id": "foo",
  "tags": ["Tag1", "Tag2", "Tag3"],
  "special_tags": ["SpecialTag1", "SpecialTag2", "SpecialTag3"],
  "reserved_tags": ["ReservedTag1", "ReservedTag2", "Tag1", "SpecialTag2"],
  // rest of the document
}

The fields tags, special_tags, reserved_tags are stored separately for multiple use cases. In one of the queries, I want to order the documents by number of occurrences for searched tags in all the three fields. For example, if I am searching with three tags Tag1, Tag4 and SpecialTag3, total occurrences are 2 in the above document. Using this number, I want to add a custom score to this document and sort by the score.

I am already using function_score as there are few other attributes on which the scoring depends. To compute the matched number, I tried painless script like below.

def matchedTags = 0;
def searchedTags = ["Tag1", "Tag4", "SpecialTag3"];
for (int i = 0; i < searchedTags.length; ++i) {
    if (doc['tags'].contains(searchedTags[i])) {
        matchedTags++;
        continue;
    }
    if (doc['special_tags'].contains(searchedTags[i])) {
        matchedTags++;
        continue;
    }
    if (doc['reserved_tags'].contains(searchedTags[i])) {
        matchedTags++;
    }
}
// logic to score on matchedTags (returning matchedTags for simplicity)
return matchedTags;

This runs as expected, but extremely slow. I assume that ES has to count the occurrences for each doc and cannot use indexes here. (If someone can shed light on how this will work internally or provide documentation/resources links, that would be helpful.)

I want to have two scoring functions.

  1. Score as a function of number of occurrences
  2. Score higher for higher occurrences. This is basically same as 1, but the repeated occurrences would be counted.

Is there any way where I can get benefits of both faster searching and also the custom scoring using script?

Any help is appreciated. Thanks.


Solution

  • We solved this using bitsets. We ended up creating a bitset of tags that has a set bit for all the tags we have in the document (tags, special_tags, etc.) and clear bit for rest. This is stored as one big integer. This is like a condensed version of all tags we have in one document represented as bits.

    The application knows which bit is which tag. While doing the matched tag count, we create a bitset that is set for all searched tags. Then in painless script, we cast both bitsets to big integers, take a bitwise AND and count the number of set bits.