I have a Lucene index, and I need to access some statistics such as term collection frequency. BasicStats
class has this information, however, I could not understand whether this class is accessible.
Is it possible to access BasicStats
class in Lucene 4?
BasicStats
on it's own won't do much for you. About all it does is hold values for you, it doesn't have any of the intelligence to acquire that information.
BasicStats
is intended to be used by the Similarity
implementation, which generates all the information to put into it. The methods it uses to do this in the SimilarityBase
are protected, but we can make use of the code there. To populate the BasicStats
, you'll also need a CollectionStatistics
and a TermStatistics
, but really all you'll need to get those is the Term
you are interested in, and an IndexReader
:
public static BasicStats getBasicStats(IndexReader indexReader, Term myTerm, float queryBoost) throws IOException {
String fieldName = myTerm.field();
CollectionStatistics collectionStats = new CollectionStatistics(
"field",
indexReader.maxDoc(),
indexReader.getDocCount(fieldName),
indexReader.getSumTotalTermFreq(fieldName),
indexReader.getSumDocFreq(fieldName)
);
TermStatistics termStats = new TermStatistics(
myTerm.bytes(),
indexReader.docFreq(myTerm),
indexReader.totalTermFreq(myTerm)
);
BasicStats myStats = new BasicStats(fieldName, queryBoost);
assert collectionStats.sumTotalTermFreq() == -1 || collectionStats.sumTotalTermFreq() >= termStats.totalTermFreq();
long numberOfDocuments = collectionStats.maxDoc();
long docFreq = termStats.docFreq();
long totalTermFreq = termStats.totalTermFreq();
if (totalTermFreq == -1) {
totalTermFreq = docFreq;
}
final long numberOfFieldTokens;
final float avgFieldLength;
long sumTotalTermFreq = collectionStats.sumTotalTermFreq();
if (sumTotalTermFreq <= 0) {
numberOfFieldTokens = docFreq;
avgFieldLength = 1;
} else {
numberOfFieldTokens = sumTotalTermFreq;
avgFieldLength = (float)numberOfFieldTokens / numberOfDocuments;
}
myStats.setNumberOfDocuments(numberOfDocuments);
myStats.setNumberOfFieldTokens(numberOfFieldTokens);
myStats.setAvgFieldLength(avgFieldLength);
myStats.setDocFreq(docFreq);
myStats.setTotalTermFreq(totalTermFreq);
return myStats;
}
If all you are after is one or two specific figures (that is, a call or two to IndexReader
), this is probably overkill, but there it is.