Search code examples
indexingelasticsearchlucenecomparisonmorelikethis

how does More_like_this elasticsearch work (into the whole index)


So first we are getting a list of termVectors, which contain all tokens, then we create a map<token, frequency in the document>. then the method createQueue will determine a score by deleting, stopWords and word which occurs not enough, compute idf, then idf * doc_frequency of a given token which is equals to its token, then we keeping the 25 best one, but after that how does it work? How is it compare to the whole index? I read http://cephas.net/blog/2008/03/30/how-morelikethis-works-in-lucene/ but that didn't explain it, or I miss the point.


Solution

  • It creates a TermQuery out of each of those terms, and chucks them all into a simple BooleanQuery, boosting each term by the previously calculated tfidf score (boostFactor * myScore / bestScore, where boostFactor can be set by the user).

    Here is the source (version 5.0):

    private Query createQuery(PriorityQueue<ScoreTerm> q) {
      BooleanQuery query = new BooleanQuery();
      ScoreTerm scoreTerm;
      float bestScore = -1;
    
      while ((scoreTerm = q.pop()) != null) {
        TermQuery tq = new TermQuery(new Term(scoreTerm.topField, scoreTerm.word));
    
        if (boost) {
          if (bestScore == -1) {
            bestScore = (scoreTerm.score);
          }
          float myScore = (scoreTerm.score);
          tq.setBoost(boostFactor * myScore / bestScore);
        }
    
        try {
          query.add(tq, BooleanClause.Occur.SHOULD);
        }
        catch (BooleanQuery.TooManyClauses ignore) {
          break;
        }
      }
      return query;
    }