I've seen this or similar question a lot on stackoverflow as well as other online sources. However, it looks like the corresponding part of Lucene's API changed quite a lot so to sum it up: I did not find any example which would work on the latest Lucene version.
What I have:
What I want: For all terms that occur only in at least one of the selected documents I want to get TF-IDF for each document. Or to say it differently: I want to get for any term that occurs in any of the selected documents its TF-IDF value, e.g., as an array (i.e., one TF-IDF value for each of the selected documents).
Any help is highly appreciated! :-)
Here's what I've come up with so far, but there are 2 problems:
public void getTfidf(IndexReader reader, Writer out, String field) throws IOException {
Bits liveDocs = MultiFields.getLiveDocs(reader);
TermsEnum termEnum = MultiFields.getTerms(reader, field).iterator(null);
BytesRef term = null;
TFIDFSimilarity tfidfSim = new DefaultSimilarity();
int docCount = reader.numDocs();
while ((term = termEnum.next()) != null) {
String termText = term.utf8ToString();
Term termInstance = new Term(field, term);
// term and doc frequency in all documents
long indexTf = reader.totalTermFreq(termInstance);
long indexDf = reader.docFreq(termInstance);
double tfidf = tfidfSim.tf(indexTf) * tfidfSim.idf(docCount, indexDf);
// store it, but that's not the problem
totalTermFreq
does what it sounds like, provide the frequency across the entire index. The TF in the calculation should be the term frequency within the document, not across the entire index.. That's why everything you get here is constant, all of your variables are constant across the entire index, non are dependant on the document. In order to get term frequency for a document, you should use DocsEnum.freq()
. Perhaps something like:
while ((term = termEnum.next()) != null) {
Term termInstance = new Term(field, term);
long indexDf = reader.docFreq(termInstance);
DocsEnum docs = termEnum.docs(reader.getLiveDocs())
while(docs.next() != DocsEnum.NO_MORE_DOCS) {
double tfidf = tfidfSim.tf(docs.freq()) * tfidfSim.idf(docCount, indexDf);
// store it