Search code examples
javalucenedocument-classification

Using multiple Leaves in Lucene Classifiers


I am trying to use the KNearestNeighbour classifier in lucene. The document classifier accepts a leafReader in its constructor, for training the classifier. The problem is that, the Index I am using to train the classifier has multiple leaves. But the constructor for the class only accepts one leaf, and I could not find a process to add the remaining LeafReaders to the Class. I might be missing out on something. Could anyone help please me out with this?

Here is the code I am using currently :

    FSDirectory index = FSDirectory.open(Paths.get(indexLoc));
    IndexReader reader = DirectoryReader.open(index);
    LeafReaderContext leaf = leaves.get(0);
    LeafReader atomicReader = leaf.reader();
    KNearestNeighborDocumentClassifier knn = new KNearestNeighborDocumentClassifier(atomicReader, BM25, null, 10, 0, 0, "Topics", field2analyzer, "Text");

Solution

  • Leaves represent each segment of you index. In terms of performance and resource usage, you should iterate over the leaves, run the classification for each segment and accumulate your results.

    for (LeafReaderContext context : indexReader.getContext().leaves()) {
      LeafReader reader = context.reader();
      // run for each leaf
    }
    

    If that is not possible, you can use the SlowCompositeReaderWrapper which, as the name suggests, might be very slow as it aggregates all the leaves on the fly.

    LeafReader singleLeaf = SlowCompositeReaderWrapper.wrap(indexReader);
    // run classifier on singleLeaf
    

    Depending on your Lucene version, this sits in lucene-core or lucene-misc (since Lucene 6.0, I think). Also, this class is deprecated and scheduled for removal in Lucene 7.0.

    The third option might be to run forceMerge(1) until you only have one segment and you can use the single leaf for this. However, forcing a merge down to a single segment has other issues and might not work for your use case. If you data is write-once and then only used for reading, you could use a forceMerge. If you have regular updates, you'll have to end up using the first option and aggregate the classification result yourself.