Search code examples
javaspringlucene

How to glue (merge) files Lucene?


I integrate Apache Lucene into Spring Boot application (this is my first experience) and everything good, but I see a bunch of files - indexes: .cfs .si .cfe; How to combine them and is it necessary to do so, if I plan to reach 1 billion files in the index?

<dependency>
    <groupId>org.apache.lucene</groupId>
    <artifactId>lucene-core</artifactId>
    <version>9.8.0</version>
</dependency>

For add new data to index, I wrote the next simple method:

synchronized public void addToIndex(IndexData data) {
    Document doc = setDocument(data.id, data.body, data.coutry);
    try {
        writer.addDocument(doc);
        writer.commit();
        writer.maybeMerge();
        writer.flush();
        doc.clear();
    } catch (IOException e)
    { e.printStackTrace();}
}

This method located in the class singleton with configuration for IndexWriter: config.setOpenMode(IndexWriterConfig.OpenMode.CREATE_OR_APPEND); Is it enough to call "maybeMerge()" for Lucene does merge files itself when it is needed?


Solution

  • Bottom line:

    If you are not facing a specific problem, then there is probably nothing you need to change, regarding how segment merges are automatically managed by Lucene.


    More notes:

    Yes, a Lucene index directory will contain "a bunch of files" - see Apache Lucene - Index File Formats for an overview.

    Groups of related files form segments, where:

    Each segment is a fully independent index, which could be searched separately.

    Segments (and their related files) are automatically created and merged by Lucene, as it deems necessary/appropriate, as documents are added to (and removed from) the index. You do not need to take any specific action, unless you are facing a specific situation where a manually triggered merge may be beneficial.

    There is a performance cost associated with Lucene needing to search across multiple segments; conversely, there is a performance cost associated with performing a merge. My advice: You should assume Lucene knows best, and leave it to manage its segments itself, unless you are certain you have a good reason to do otherwise.

    For example, see the JavaDoc for forceMerge(), where it states:

    This is a horribly costly operation, especially when you pass a small maxNumSegments; usually you should only call this if the index is static (will no longer be changed).

    For maybeMerge(), I'd give the same advice as above: leave it to Lucene, unless you have a very specific reason/problem to intervene. I would absolutely not want to call writer.maybeMerge(); a billion times, on the off-chance that a merge may happen a few of those times.