Search code examples
c#.netlucene.net

Lucene.NET returning the wrong document in queries after document is updated


Search works as expected until I update a document in the index. The document that was updated no longer returns in searches, rather a complete unrelated document with docId=0 gets returned instead.

This is how I set it up:

var luceneVersion = LuceneVersion.LUCENE_48;
var analyzersPerField = new Dictionary<string, Analyzer>
{
    ["name"] = new KeywordAnalyzer()
};
var __analyzer = new PerFieldAnalyzerWrapper(new StandardAnalyzer(luceneVersion), analyzersPerField);
var __luceneDirectory = FSDirectory.Open(_searchDirectory);
var indexConfig = new IndexWriterConfig(luceneVersion, __analyzer)
{
    OpenMode = OpenMode.CREATE_OR_APPEND,
};
var _writer = new IndexWriter(__luceneDirectory, indexConfig);
var __directoryReader = DirectoryReader.Open(__luceneDirectory);
var _searcher = new IndexSearcher(__directoryReader);
var _queryParser = new StandardQueryParser(__analyzer);

Here is how documents are defined:

private static Document CreateDocument(FileResult fileResult)
{
    var document = new Document
    {
        new StringField("id", fileResult.Id, Field.Store.YES)
    };
    document.Add(new TextField("baseKeywords", fileResult.AlternativeName, Field.Store.NO));
    document.Add(new StringField("name", fileResult.AlternativeName, Field.Store.YES));
    return document;
}

Here is how documents are updated:

public void UpdateDocumentName(FileResult fileResult, string newName)
{
    fileResult.AlternativeName = newName;
    var document = CreateDocument(fileResult);
    _writer.UpdateDocument(new Term("id", fileResult.Id), document);
}

After an update is done, I do a commit and create the new reader:

_writer.Commit();
__directoryReader = DirectoryReader.OpenIfChanged(__directoryReader) ?? __directoryReader;
_searcher = new IndexSearcher(__directoryReader);

How documents are searched:

_queryString = QueryParserUtil.Escape(searchParameters.QueryString);
var query = _queryParser.Parse(searchParameters.QueryString, "baseKeywords");
_searcher.Search(query, resultsCollector.GetCollector());

That custom collector is defined as:

private List<(int doc, float score)> _results;

public ICollector GetCollector()
{
    return Collector.NewAnonymous(setScorer: (scorer) =>
    {
        _scorer = scorer;
    }, collect: (doc) =>
    {
        _results.Add((doc, _scorer.GetScore()));
    }, setNextReader: (context) =>
    {
        //
    }, acceptsDocsOutOfOrder: () =>
    {
        return true;
    });
}

These are excerpts from the full source, which is available here

This behavior is consistent, even when using a query on a different field that would only return the updated document, I get docId=0. If I open the index in Luke, this behavior is not present and I'm able to query for the updated document.

The document I'm updating has the highest docId number, so maybe I'm thinking it's a off by one problem?

I tried updated a different document, now I whenever I query for either, I get docId=0.

If I query for the document with docId=0, I get the document with docId=0, not the updated document.

I originally was using the reader from _writer.GetReader(applyAllDeletes: true); but switching to DirectoryReader.Open did not change anything.

Maybe this is some sort of default return value I'm not aware of? Not sure why it is returning docId=0 all the time.


Solution

  • The custom collector was set up wrong!

    From the Lucene.NET documentation:

    NOTE: The doc that is passed to the collect method is relative to the current reader. If your collector needs to resolve this to the docID space of the Multi*Reader, you must re-base it by recording the docBase from the most recent SetNextReader(AtomicReaderContext) call.

    Here is the fixed collector:

    private List<(int doc, float score)> _results;
    private int docBase;
    
    public ICollector GetCollector()
    {
        return Collector.NewAnonymous(setScorer: (scorer) =>
        {
            _scorer = scorer;
        }, collect: (doc) =>
        {
            _results.Add((doc + docBase, _scorer.GetScore()));
        }, setNextReader: (context) =>
        {
            docBase = context.DocBase;
        }, acceptsDocsOutOfOrder: () =>
        {
            return true;
        });
    }