I'm wrote simple .NET Windows service, that pushes documents to Apache Solr v4.1. For access to Solr, I used SolrNet. My code is:
var solr = _container.Resolve<ISolrOperations<Document>>();
solr.Delete(SolrQuery.All);
var docs = from o in documents
orderby o.Id ascending
select o;
for (var i = 0; i < docs.Count(); i++ )
{
var texts = new List<string>();
if (docs.ToList()[i].DocumentAttachments.Count > 0)
{
foreach (var attach in docs.ToList()[i].DocumentAttachments)
{
using (var fileStream = System.IO.File.OpenRead(...))
{
var extractResult = solr.Extract(
new ExtractParameters(fileStream, attach.Id.ToString(CultureInfo.InvariantCulture))
{
ExtractFormat = ExtractFormat.Text,
ExtractOnly = true
}
);
texts.Add(extractResult.Content);
}
}
}
docs.ToList()[i].GetFilesText = texts;
solr.Add(docs.ToList()[i]);
if (i % _commitStep == 0)
{
solr.Commit();
solr.Optimize();
}
}
solr.Commit();
solr.Optimize();
solr.BuildSpellCheckDictionary();
"Document.GetFilesText" - this is a field, storing text, extracted from pdf files.
This example is cleaned from logging methods(writes to Windows Event Log). While indexing, I'm watched to:
a) Event Log - shows documents indexing progress
b) "Core Admin" page in "Solr Admin" webapp - shows count of documents in index
When I'm just indexing documents, without searching, all works right - event log shows "7500 docs added" entry, "Core Admin" shows num docs = 7500.
But, if I try to search documents during indexing, I have these errors:
- search results contains not all passed documents
- "Core Admin" resets num docs value. For example, EventLog shows 7500 docs indexed, but "Core Admin" shows num docs=23. And num docs resets every time, when I'm querying Solr.
My querying code:
searchPhrase = textBox1.Text;
var documents = Solr.Query(new SolrQuery(searchPhrase), new QueryOptions
{
Highlight = new HighlightingParameters
{
UsePhraseHighlighter = true,
Fields = new Collection<string> { "Field1", "Field2", "Field3" },
BeforeTerm = "<b>",
AfterTerm = "</b>"
},
Rows = 100
});
UPD: to make things clear I have these lines in my webapp's "search" page:
public class MyController : Controller
{
public ISolrOperations<Document> Solr { get; set; }
public MyController()
{
//_solr = solr;
}
//
// GET: /Search/My/
public ActionResult Index()
{
Solr.Delete(SolrQuery.All);
return View();
}
...
And, opening this page in browser, causes totally loss of documents from Solr index.:-)
You are seeing this behavior because the first thing you do is clear the index.
solr.Delete(SolrQuery.All)
This removes all documents from the index. So once reindexing starts the index will be empty. Now in your subsequent code you are adding the items back into the index in batches. However any new documents you add to the index will not be visible to users querying the index until a commit is issued. Since you are adding documents and issuing commits in batches during that explains why your document counts are increasing while you are rebuilding and why not all documents are visible. Your counts and total documents in the index will not be 7500 until the last commit is issued.
There might be a couple of options to help alleviate this for you.
AddParameter
to the Add method in SolrNet. You could issue solr.Add(docs.ToList()[i], new AddParameters{ CommitWithin = 3000});
which would tell Solr to commit this batch of items within 3 seconds.Hope this helps.