So first we are getting a list of termVectors, which contain all tokens, then we create a map<token, frequency in the document>.
then the method createQueue will determine a score by deleting, stopWords and word which occurs not enough, compute idf, then idf * doc_frequency of a given token which is equals to its token, then we keeping the 25 best one, but after that how does it work? How is it compare to the whole index? I read http://cephas.net/blog/2008/03/30/how-morelikethis-works-in-lucene/ but that didn't explain it, or I miss the point.
It creates a TermQuery
out of each of those terms, and chucks them all into a simple BooleanQuery
, boosting each term by the previously calculated tfidf score (boostFactor * myScore / bestScore
, where boostFactor can be set by the user).
Here is the source (version 5.0):
private Query createQuery(PriorityQueue<ScoreTerm> q) {
BooleanQuery query = new BooleanQuery();
ScoreTerm scoreTerm;
float bestScore = -1;
while ((scoreTerm = q.pop()) != null) {
TermQuery tq = new TermQuery(new Term(scoreTerm.topField, scoreTerm.word));
if (boost) {
if (bestScore == -1) {
bestScore = (scoreTerm.score);
}
float myScore = (scoreTerm.score);
tq.setBoost(boostFactor * myScore / bestScore);
}
try {
query.add(tq, BooleanClause.Occur.SHOULD);
}
catch (BooleanQuery.TooManyClauses ignore) {
break;
}
}
return query;
}