I am writing very large (both size and count) documents to a solr index(100s of fields with many numeric and some text fields) . I am using Tomcat 7 on W7 x64.
Based on @Maurico's suggestion when indexing millions of documents I parallelize the write operation (see code sample below)
The write to Solr method is being "Task"ed out from a main loop (Note: I task it out since the write op takes too long and holds up the main app)
The problem is that the memory consumption grows uncontrollably, the culprit is the solr write operations (when I comment them out the run works fine). How do I handle this issue? via Tomcat? or SolrNet?
Thanks for your suggestions.
//main loop:
{
:
:
:
//indexDocsList is the list I create in main loop and "chunk" it out to send to the task.
List<IndexDocument> indexDocsList = new List<IndexDocument>();
for(int n = 0; n< N; n++)
{
indexDocsList.Add(new IndexDocument{X=1, Y=2.....});
if(n%5==0) //every 5th time we write to solr
{
var chunk = new List<IndexDocument>(indexDocsList);
indexDocsList.Clear();
Task.Factory.StartNew(() => WriteToSolr(chunk)).ContinueWith(task => chunk.Clear());
GC.Collect();
}
}
}
private void WriteToSolr(List<IndexDocument> indexDocsList)
{
try
{
if (indexDocsList == null) return;
if (indexDocsList.Count <= 0) return;
int fromInclusive = 0;
int toExclusive = indexDocsList.Count;
int subRangeSize = 25;
//TO DO: This is still leaking some serious memory, need to fix this
ParallelLoopResult results = Parallel.ForEach(Partitioner.Create(fromInclusive, toExclusive, subRangeSize), (range) =>
{
_solr.AddRange(indexDocsList.GetRange(range.Item1, range.Item2 - range.Item1));
_solr.Commit();
});
indexDocsList.Clear();
GC.Collect();
}
catch (Exception ex)
{
logger.ErrorException("WriteToSolr()", ex);
}
finally
{
GC.Collect();
};
return;
}
You are manually committing after each batch. This is the most expensive operation for Solr. In your case, I would recommend autoCommit every x seconds and do a softAutoCommit (Solr 4.0) feature. That should take care of Solr's side of things. You'll also have to tweak your JVM garbage collection options so that you don't get pause the world GC.