Search code examples
javalucene

Lucene: TF-IDF of each token


I'm trying to study Lucene 8, this is my first time with Lucene.

I want to have the TF-IDF of each term, in order to obtain the top 10.000 tokens in my Lucene Directory. I've tried in many ways but I'm stuck, I don't know how to proceed. This is an example of what I did:

private static void getTokensForField(IndexReader reader, String fieldName) throws IOException {

        List<LeafReaderContext> list = reader.leaves();
        Similarity similarity = new ClassicSimilarity();

        int docnum = reader.numDocs();

        for (LeafReaderContext lrc : list) {
            Terms terms = lrc.reader().terms(fieldName);
            if (terms != null) {
                TermsEnum termsEnum = terms.iterator();

                BytesRef term;
                while ((term = termsEnum.next()) != null) {
                    double tf = termsEnum.totalTermFreq() / terms.size();
                    double idf =Math.log(docnum  / termsEnum.docFreq());
                   // System.out.println(term.utf8ToString() + "\tTF: " + tf + "\tIDF: " + idf);
                }
            }
        }
    }

I'm actually studying this topic, but the resources I've found are not really useful.

I've also searched on the Internet, but everything is deprecated.

Do you have some suggestions?


Solution

  • The simplest way I know to access statistics such as TF and IDF is to use the Explanation class.

    However, just to clarify (apologies if I am telling you what you already know): The Term Frequency value is for a term within a document - so the same term may result in different values, across different docs.

    I'm not really sure what that means for your wish to "obtain the top 10.000 tokens in my Lucene Directory". Perhaps that means you will need to calculate the TF for every term in every document, and then pick the "best" value for that term, for your needs?

    Here is a simple example of building an Explanation:

    private static void getExplanation(IndexSearcher searcher, Query query, int docID) throws IOException {
        Explanation explanation = searcher.explain(query, docID);
        //explanation.getDescription(); // do what you need with this data
        //explanation.getDetails();     // do what you need with this data
        }
    

    So, you might call this method as you iterate over the hits for a query:

    private static void printHits(Query query) throws IOException {
        IndexReader reader = DirectoryReader.open(FSDirectory.open(Paths.get(INDEX_PATH)));
    
        IndexSearcher searcher = new IndexSearcher(reader);
        TopDocs results = searcher.search(query, 100); // or whatever you need instead of 100
        ScoreDoc[] hits = results.scoreDocs;
        for (ScoreDoc hit : hits) {
            getExplanation(searcher, query, hit.doc);
        }
    }
    

    The information provided by explanation.getDetails() is basically the same as the information you would see if you were to use Luke to analyze a query:

    enter image description here

    As text:

    0.14566182 weight(body:war in 3) [BM25Similarity], result of:
      0.14566182 score(freq=1.0), computed as boost * idf * tf from:
        0.2876821 idf, computed as log(1 + (N - n + 0.5) / (n + 0.5)) from:
          4 n, number of documents containing term
          5 N, total number of documents with field
        0.50632906 tf, computed as freq / (freq + k1 * (1 - b + b * dl / avgdl)) from:
          1.0 freq, occurrences of term within document
          1.2 k1, term saturation parameter
          0.75 b, length normalization parameter
          3.0 dl, length of field
          4.0 avgdl, average length of field