I have noticed that, my lucene index segment files (file names) are always changing constantly, even when I am not performing any add, update, or delete operations. The only operations I am performing is reading and searching. So, my question is, does Lucene index segment files get updated internally somehow just from reading and searching operations?
I am using Lucene.Net v4.8 beta, if that matters. Thanks!
Here is an example of how I found this issue (I wanted to get the index size). Assuming a Lucene Index already exists, I used the following code to get the index size:
Example:
private long GetIndexSize()
{
var reader = GetDirectoryReader("validPath");
long size = 0;
foreach (var fileName in reader.Directory.ListAll())
{
size += reader.Directory.FileLength(fileName);
}
return size;
}
private DirectoryReader GetDirectoryReader(string path)
{
var directory = FSDirectory.Open(path);
var reader = DirectoryReader.Open(directory);
return reader;
}
The above method is called every 5 minutes. It works fine ~98% of the time. However, the other 2% of the time, I would get the error file not found
in the foreach
loop, and after debugging, I saw that the files in reader.Directory
are changing in count. The index is updated at certain times by another service, but I can assure that no updates were made to the index anywhere near the times when this error occurs.
Since you have multiple processes writing/reading the same set of files, it is difficult to isolate what is happening. Lucene.NET does locking and exception handling to ensure operations can be synced up between processes, but if you read the files in the directory directly without doing any locking, you need to be prepared to deal with IOException
s.
The solution depends on how up to date you need the index size to be:
Directory.ListAll()
because that method stores the file names in an array, which may go stale before the loop is done. But, you still need to catch FileNotFoundException
and ignore it and possibly deal with other IOException
s.private long GetIndexSize()
{
// DirectoryReader is superfluous for this example. Also,
// using a MMapDirectory (which DirectoryReader.Open() may return)
// will use more RAM than simply using SimpleFSDirectory.
var directory = new SimpleFSDirectory("validPath");
long size = 0;
// NOTE: The lock will stay active until this is disposed,
// so if you have any follow-on actions to perform, the lock
// should be obtained before calling this method and disposed
// after you have completed all of your operations.
using Lock writeLock = directory.MakeLock(IndexWriter.WRITE_LOCK_NAME);
// Obtain exclusive write access to the directory
if (!writeLock.Obtain(/* optional timeout */))
{
// timeout failed, either throw an exception or retry...
}
foreach (var fileName in directory.ListAll())
{
size += directory.FileLength(fileName);
}
return size;
}
Of course, if you go that route, your IndexWriter
may throw a LockObtainFailedException
and you should be prepared to handle them during the write process.
However you deal with it, you need to be catching and handling exceptions because IO by its nature has many things that can go wrong. But exactly how you deal with it depends on what your priorities are.
If you have an IndexWriter
instance open, Lucene.NET will run a background process to merge segments based on the MergePolicy
being used. The default settings can be used with most applications.
However, the settings are configurable through the IndexWriterConfig.MergePolicy
property. By default, it uses the TieredMergePolicy
.
var config = new IndexWriterConfig(LuceneVersion.LUCENE_48, analyzer)
{
MergePolicy = new TieredMergePolicy()
};
There are several properties on TieredMergePolicy
that can be used to change the thresholds that it uses to merge.
Or, it can be changed to a different MergePolicy
implementation. Lucene.NET comes with:
The NoMergePolicy
class can be used to disable merging entirely.
If your application never needs to add documents to the index (for example, if the index is built as part of the application deployment), it is also possible to use a
IndexReader
from aDirectory
instance directly, which does not do any background segment merges.
The merge scheduler can also be swapped and/or configured using the IndexWriterConfig.MergeScheduler
property. By default, it uses the ConcurrentMergeScheduler
.
var config = new IndexWriterConfig(LuceneVersion.LUCENE_48, analyzer)
{
MergePolicy = new TieredMergePolicy(),
MergeScheduler = new ConcurrentMergeScheduler()
};
The merge schedulers that are included with Lucene.NET 4.8.0 are:
The NoMergeScheduler
class can be used to disable merging entirely. This has the same effect as using NoMergePolicy
, but also prevents any scheduling code from being executed.