Lucene index files changing constantly even when there is no adding, updating, or deletion operations performed on it

I have noticed that, my lucene index segment files (file names) are always changing constantly, even when I am not performing any add, update, or delete operations. The only operations I am performing is reading and searching. So, my question is, does Lucene index segment files get updated internally somehow just from reading and searching operations?

I am using Lucene.Net v4.8 beta, if that matters. Thanks!

Here is an example of how I found this issue (I wanted to get the index size). Assuming a Lucene Index already exists, I used the following code to get the index size:

Example:

private long GetIndexSize()
        {
            var reader = GetDirectoryReader("validPath");
            long size = 0;

            foreach (var fileName in reader.Directory.ListAll())
            {
                size += reader.Directory.FileLength(fileName);
            }

            return size;
        }

private DirectoryReader GetDirectoryReader(string path)
{
    var directory = FSDirectory.Open(path);
    var reader = DirectoryReader.Open(directory);
    return reader;
}

The above method is called every 5 minutes. It works fine ~98% of the time. However, the other 2% of the time, I would get the error file not found in the foreach loop, and after debugging, I saw that the files in reader.Directory are changing in count. The index is updated at certain times by another service, but I can assure that no updates were made to the index anywhere near the times when this error occurs.

Solution

Since you have multiple processes writing/reading the same set of files, it is difficult to isolate what is happening. Lucene.NET does locking and exception handling to ensure operations can be synced up between processes, but if you read the files in the directory directly without doing any locking, you need to be prepared to deal with IOExceptions.

The solution depends on how up to date you need the index size to be:

If it is okay to be a bit out of date, I would suggest using DirectoryInfo.EnumerateFiles on the directory itself. This may be a bit more up to date than Directory.ListAll() because that method stores the file names in an array, which may go stale before the loop is done. But, you still need to catch FileNotFoundException and ignore it and possibly deal with other IOExceptions.
If you need the size to be absolutely up to date and plan to do an operation that requires the index to be that size, you need to open a write lock to prevent the files from changing while you get the value.

private long GetIndexSize()
{
    // DirectoryReader is superfluous for this example. Also,
    // using a MMapDirectory (which DirectoryReader.Open() may return)
    // will use more RAM than simply using SimpleFSDirectory.
    var directory = new SimpleFSDirectory("validPath");
    long size = 0;

    // NOTE: The lock will stay active until this is disposed,
    // so if you have any follow-on actions to perform, the lock
    // should be obtained before calling this method and disposed
    // after you have completed all of your operations.
    using Lock writeLock = directory.MakeLock(IndexWriter.WRITE_LOCK_NAME);

    // Obtain exclusive write access to the directory
    if (!writeLock.Obtain(/* optional timeout */))
    {
         // timeout failed, either throw an exception or retry...
    }

    foreach (var fileName in directory.ListAll())
    {
        size += directory.FileLength(fileName);
    }

    return size;
}

Of course, if you go that route, your IndexWriter may throw a LockObtainFailedException and you should be prepared to handle them during the write process.

However you deal with it, you need to be catching and handling exceptions because IO by its nature has many things that can go wrong. But exactly how you deal with it depends on what your priorities are.

Original Answer

If you have an IndexWriter instance open, Lucene.NET will run a background process to merge segments based on the MergePolicy being used. The default settings can be used with most applications.

However, the settings are configurable through the IndexWriterConfig.MergePolicy property. By default, it uses the TieredMergePolicy.

var config = new IndexWriterConfig(LuceneVersion.LUCENE_48, analyzer)
{
    MergePolicy = new TieredMergePolicy()
};

There are several properties on TieredMergePolicy that can be used to change the thresholds that it uses to merge.

Or, it can be changed to a different MergePolicy implementation. Lucene.NET comes with:

The NoMergePolicy class can be used to disable merging entirely.

If your application never needs to add documents to the index (for example, if the index is built as part of the application deployment), it is also possible to use a IndexReader from a Directory instance directly, which does not do any background segment merges.

The merge scheduler can also be swapped and/or configured using the IndexWriterConfig.MergeScheduler property. By default, it uses the ConcurrentMergeScheduler.

var config = new IndexWriterConfig(LuceneVersion.LUCENE_48, analyzer)
{
    MergePolicy = new TieredMergePolicy(),
    MergeScheduler = new ConcurrentMergeScheduler()
};

The merge schedulers that are included with Lucene.NET 4.8.0 are:

The NoMergeScheduler class can be used to disable merging entirely. This has the same effect as using NoMergePolicy, but also prevents any scheduling code from being executed.