Search code examples
c#.netsolrsolrnet

Querying Solr while indexing causes loss documents from index


I'm wrote simple .NET Windows service, that pushes documents to Apache Solr v4.1. For access to Solr, I used SolrNet. My code is:

var solr = _container.Resolve<ISolrOperations<Document>>();             
solr.Delete(SolrQuery.All);

var docs = from o in documents
           orderby o.Id ascending
           select o;

for (var i = 0; i < docs.Count(); i++ )
{
    var texts = new List<string>();
    if (docs.ToList()[i].DocumentAttachments.Count > 0)
    {
        foreach (var attach in docs.ToList()[i].DocumentAttachments)
        {
            using (var fileStream = System.IO.File.OpenRead(...))
            {
                var extractResult = solr.Extract(
                    new ExtractParameters(fileStream, attach.Id.ToString(CultureInfo.InvariantCulture))
                    {
                        ExtractFormat = ExtractFormat.Text,
                        ExtractOnly = true
                    }
                );
                texts.Add(extractResult.Content);                   
            }
        }
    }

    docs.ToList()[i].GetFilesText = texts;
    solr.Add(docs.ToList()[i]);

    if (i % _commitStep == 0)
    {
        solr.Commit();
        solr.Optimize();
    }
}

solr.Commit();
solr.Optimize();
solr.BuildSpellCheckDictionary();

"Document.GetFilesText" - this is a field, storing text, extracted from pdf files.
This example is cleaned from logging methods(writes to Windows Event Log). While indexing, I'm watched to:
a) Event Log - shows documents indexing progress
b) "Core Admin" page in "Solr Admin" webapp - shows count of documents in index

When I'm just indexing documents, without searching, all works right - event log shows "7500 docs added" entry, "Core Admin" shows num docs = 7500.

But, if I try to search documents during indexing, I have these errors:
- search results contains not all passed documents
- "Core Admin" resets num docs value. For example, EventLog shows 7500 docs indexed, but "Core Admin" shows num docs=23. And num docs resets every time, when I'm querying Solr.

My querying code:

searchPhrase = textBox1.Text;
var documents = Solr.Query(new SolrQuery(searchPhrase), new QueryOptions
    {
        Highlight = new HighlightingParameters
            {
                UsePhraseHighlighter = true,
                Fields = new Collection<string> { "Field1", "Field2", "Field3" },
                BeforeTerm = "<b>",
                AfterTerm = "</b>"
            },
        Rows = 100
    });

UPD: to make things clear I have these lines in my webapp's "search" page:

public class MyController : Controller
{
    public ISolrOperations<Document> Solr { get; set; }

    public MyController()
    {
        //_solr = solr;
    }

    //
    // GET: /Search/My/
    public ActionResult Index()
    {
        Solr.Delete(SolrQuery.All);

        return View();
    }
...

And, opening this page in browser, causes totally loss of documents from Solr index.:-)


Solution

  • You are seeing this behavior because the first thing you do is clear the index.

    solr.Delete(SolrQuery.All)
    

    This removes all documents from the index. So once reindexing starts the index will be empty. Now in your subsequent code you are adding the items back into the index in batches. However any new documents you add to the index will not be visible to users querying the index until a commit is issued. Since you are adding documents and issuing commits in batches during that explains why your document counts are increasing while you are rebuilding and why not all documents are visible. Your counts and total documents in the index will not be 7500 until the last commit is issued.

    There might be a couple of options to help alleviate this for you.

    1. Issue soft commits to Solr using commitWithin or auto soft commits to Solr. CommitWithin is supported as an optional AddParameter to the Add method in SolrNet. You could issue solr.Add(docs.ToList()[i], new AddParameters{ CommitWithin = 3000}); which would tell Solr to commit this batch of items within 3 seconds.
    2. Use Solr Cores to have an "active" core that users are searching against and reload your logs data into a "standby" core. Once the load process to the standby core has completed, you can issue a command to SWAP the cores and this will be totally transparent to any users. CoreAdmin commands are supported in SolrNet as well, see the the tests in SolrCoreAdminFixture.cs for examples.

    Hope this helps.