Search code examples
lucenetf-idf

Lucene 4.9: Get TF-IDF for a few selected documents from an Index


I've seen this or similar question a lot on stackoverflow as well as other online sources. However, it looks like the corresponding part of Lucene's API changed quite a lot so to sum it up: I did not find any example which would work on the latest Lucene version.

What I have:

  • Lucene Index + IndexReader + IndexSearcher
  • a bunch of documents (and their IDs)

What I want: For all terms that occur only in at least one of the selected documents I want to get TF-IDF for each document. Or to say it differently: I want to get for any term that occurs in any of the selected documents its TF-IDF value, e.g., as an array (i.e., one TF-IDF value for each of the selected documents).

Any help is highly appreciated! :-)

Here's what I've come up with so far, but there are 2 problems:

  1. It is using a temporarily created RAMDirectory which contains only the selected Documents. Is there any way to work on the original Index or does that not make sense?
  2. It does not get document based TF IDF but somehow only index based, ie., all documents. Which means for each term I only get one TF-IDF value but not one for each document and term.

public void getTfidf(IndexReader reader, Writer out, String field) throws IOException {

    Bits liveDocs = MultiFields.getLiveDocs(reader);
    TermsEnum termEnum = MultiFields.getTerms(reader, field).iterator(null);
    BytesRef term = null;
    TFIDFSimilarity tfidfSim = new DefaultSimilarity();
    int docCount = reader.numDocs();

    while ((term = termEnum.next()) != null) {
        String termText = term.utf8ToString();
        Term termInstance = new Term(field, term);
        // term and doc frequency in all documents
        long indexTf = reader.totalTermFreq(termInstance); 
        long indexDf = reader.docFreq(termInstance);       
        double tfidf = tfidfSim.tf(indexTf) * tfidfSim.idf(docCount, indexDf);
        // store it, but that's not the problem

Solution

  • totalTermFreq does what it sounds like, provide the frequency across the entire index. The TF in the calculation should be the term frequency within the document, not across the entire index.. That's why everything you get here is constant, all of your variables are constant across the entire index, non are dependant on the document. In order to get term frequency for a document, you should use DocsEnum.freq(). Perhaps something like:

    while ((term = termEnum.next()) != null) {
        Term termInstance = new Term(field, term);
        long indexDf = reader.docFreq(termInstance);      
    
        DocsEnum docs = termEnum.docs(reader.getLiveDocs())
        while(docs.next() != DocsEnum.NO_MORE_DOCS) {
            double tfidf = tfidfSim.tf(docs.freq()) * tfidfSim.idf(docCount, indexDf);
            // store it