I want to get OpenSearch results sorted (in descending order) by the number of keywords in the document. _score isn't what I am looking for. This one uses the BM25 algorithm, which is more like ranking with NLP techniques.
Example: I am searching for 2 phrases, 'happy' and 'cat'
What I have: I am getting documents sorted by _score (which is not what I want - as the long text with 5 keywords is ranked lower than a short document with 2 keywords)
What I want: I want the long document with 5 keywords to be at the top and the document with 2 keywords to be below.
My solution now: I have a Java code solution for it, but that creates a bottleneck for the API. I am basically counting keywords words and then sorting documents by the number of keywords. That takes 2 parallel streams. I'm still blocking API for way too long. I am searching for a 'pure' OpenSearch solution.
I found the solution myself. I had to change the standard Okapi BM25 algorithm setting to one below. This has to be done when creating index.
BM25 params:
k1 - is a tuning parameter (usually set between 1.2 and 2.0) it controls the term frequency saturation. Increasing k1 increases the saturation effect.
b - is another tuning parameter (usually set around 0.75) – it controls the length normalization. A value of 1.0 fully normalizes document length, while 0.0 ignores length normalization.
{
"settings": {
"index": {
"similarity": {
"default": {
"type": "BM25",
"k1": 2.0,
"b": 0.0
}
}
}
}