Search code examples
javaluceneinformation-retrievallanguage-model

Accessing terms statistics in Lucene 4


I have a Lucene index, and I need to access some statistics such as term collection frequency. BasicStats class has this information, however, I could not understand whether this class is accessible.

Is it possible to access BasicStats class in Lucene 4?


Solution

  • BasicStats on it's own won't do much for you. About all it does is hold values for you, it doesn't have any of the intelligence to acquire that information.

    BasicStats is intended to be used by the Similarity implementation, which generates all the information to put into it. The methods it uses to do this in the SimilarityBase are protected, but we can make use of the code there. To populate the BasicStats, you'll also need a CollectionStatistics and a TermStatistics, but really all you'll need to get those is the Term you are interested in, and an IndexReader:

    public static BasicStats getBasicStats(IndexReader indexReader, Term myTerm, float queryBoost) throws IOException {
        String fieldName = myTerm.field();
    
        CollectionStatistics collectionStats = new CollectionStatistics(
                "field",
                indexReader.maxDoc(),
                indexReader.getDocCount(fieldName),
                indexReader.getSumTotalTermFreq(fieldName),
                indexReader.getSumDocFreq(fieldName)
                );
    
        TermStatistics termStats = new TermStatistics(
                myTerm.bytes(),
                indexReader.docFreq(myTerm),
                indexReader.totalTermFreq(myTerm)
                );
    
        BasicStats myStats = new BasicStats(fieldName, queryBoost);
        assert collectionStats.sumTotalTermFreq() == -1 || collectionStats.sumTotalTermFreq() >= termStats.totalTermFreq();
        long numberOfDocuments = collectionStats.maxDoc();
    
        long docFreq = termStats.docFreq();
        long totalTermFreq = termStats.totalTermFreq();
    
        if (totalTermFreq == -1) {
          totalTermFreq = docFreq;
        }
    
        final long numberOfFieldTokens;
        final float avgFieldLength;
    
        long sumTotalTermFreq = collectionStats.sumTotalTermFreq();
    
        if (sumTotalTermFreq <= 0) {
            numberOfFieldTokens = docFreq;
            avgFieldLength = 1;
        } else {
            numberOfFieldTokens = sumTotalTermFreq;
            avgFieldLength = (float)numberOfFieldTokens / numberOfDocuments;
        }
    
        myStats.setNumberOfDocuments(numberOfDocuments);
        myStats.setNumberOfFieldTokens(numberOfFieldTokens);
        myStats.setAvgFieldLength(avgFieldLength);
        myStats.setDocFreq(docFreq);
        myStats.setTotalTermFreq(totalTermFreq);
    
        return myStats;
    }
    

    If all you are after is one or two specific figures (that is, a call or two to IndexReader), this is probably overkill, but there it is.