I am looking into the different similarity algorithms which define how the score of each document is computed during search. The available algorithms are listed here: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/index-modules-similarity.html
My problem is that I have problems to understand them when digging through the wikipedia articles or the class descriptions in the lucene API documentation. I really like the answer about explaining the TF/IDF similarity algorithm (the default in ElasticSearch) here: What is the reasoning behind the ranking of this ElasticSearch query? (so this one I understand to a certain amount).
Can somebody provide similiar simple explanations to the other algorithms outlined there? These include:
Thank you in advance.
The problem you run into here, is by the description set forward in the linked answer, Lucene's default similarity, and bm25 are fundamentally identical, in that they both factor in:
dfr
actually encompasses 7 different base-models alone, each using a different scoring algorithm, followed by two highly configurable normalization steps. A number of configuration options fit the very general steps above, some diverge from it.
Similarly, ib
allows some significant configuration as well, but generally hits the same high points, of favoring higher term frequency, favoring matches on terms that are more rare (by some description), and adjusting for document length, boosts, and other possible normalizations.