Description (for reference):
I want to index an entire drive of files : ~2TB
I'm getting the list of files (Using commons io library).
Once I have the list of files, I go through each file and extract readable data from that using Apache Tika
Once I have the data I'm indexing it using solr.
I'm using solrj with the java application
My question is: How do I decide what size of collection to pass to Solr. I've tried passing in different sizes with different results i.e. sometimes 150 documents per collection performs better than 100 documents but sometimes they do not. Is their an optimal way / configuration that you can tweak as this process has to be carried repeatedly.
Complications :
1) Files are stored on a network drive, retrieving the filenames/files takes some time too.
2) Both this program (java app) and solr itself cannot use more than 512MB of ram
I'll name just a few parameters of a number of them that may affect the indexing speed. Usually one needs to experiment with their own hardware, RAM, data processing complexity etc. to find the best combination, i.e. there is no single silver bullet for all.
Increase the number of segments during indexing to some large number. Say, 10k. This will make sure that merging of segments will not happen as often, as it would with the default number of segments 10. Merging the segments during the indexing contributes to slowing down the indexing. You will have to merge the segments after the indexing is complete for your search engine to perform. Also lower the number of segments back to something sensible, like 10.
Reduce the logging on your container during the indexing. This can be done using the solr admin UI. This makes the process of indexing faster.
Either reduce the frequency of auto-commits or switch them off and control the committing yourself.
Remove the warmup queries for the bulk indexing, don't auto-copy any cache entries.
Use ConcurrentUpdateSolrServer and if using SolrCloud, then CloudSolrServer.