Search code examples
javasolrlucenesolrjinformation-retrieval

What's the difference: ConcurrentUpdateSolrServer vs HttpSolrServer vs CommonsHttpSolrServer?


What are the differences between the following implementations of SolrServer:

  1. ConcurrentUpdateSolrServer
  2. HttpSolrServer
  3. CommonsHttpSolrServer (Note: Is this now deprecated?)

As mentioned in the documentation:

It is only recommended to use ConcurrentUpdateSolrServer with /update requests. The class HttpSolrServer is better suited for the query interface.

The documentation for ConcurrentUpdateSolrServer suggests using it for updates and HttpSolrServer for queries. Why is this?

At the moment I am using HttpSolrServer for everything, will using ConcurrentUpdateSolrServer for updates result in significant performance improvements?


Solution

  • We are currently in 2017, and Solr community renamed SolrServer into SolrClient and currently we have 4 implementations:

    1. CloudSolrClient
    2. ConcurrentUpdateSolrClient
    3. HttpSolrClient
    4. LBHttpSolrClient

    Documentation suggests to use ConcurrentUpdateSolrClient, because it buffers all update requests into final BlockingQueue<Update> queue;, so operation time on updates will be less than using HttpSolrClient, which behaves like this - as soon as it gets update request it immediately fires it. Of course, we are trusting the documentation, but it will be so easy to get this answer, that's why I did some perf testing.

    However, first I will describe the different operations of the clients. If you're using add operation of the SolrClient, there is no difference if you gonna create HttpSolrClient or ConcurrentUpdateSolrClient, cause both methods will do the same. ConcurrentUpdateSolrClient only shines if you're explicitily doing UpdateRequest

    Test results for indexing wikipedia titles (code): My machine is: Intel i5-4670S 3.1 Ghz 16 Gb RAM

    ConcurrentUpdateSolrClient (5 threads, 1000 queue size) - 200 seconds    
    ConcurrentUpdateSolrClient (5 threads, 10000 queue size) - 150 seconds    
    ConcurrentUpdateSolrClient (10 threads, 1000 queue size) - 100 seconds    
    ConcurrentUpdateSolrClient (10 threads, 10000 queue size) - 30 seconds    
    HttpSolrClient (no bulk) - 7000 seconds    
    HttpSolrClient (bulk 1000 docs) - 150 seconds    
    HttpSolrClient (bulk 10000 docs) - 80 seconds
    

    Summary:

    1. If you're using clients in similar fashion, e.g: client.add(doc); than, ConcurrentUpdateSolrClient performing at least 10-20 times faster, because of the usage of ThreadPool and Queue (aka Bulk operation)

    2. If you're using HttpSolrClient, you still could mimic this behaviour, by manually creating several clients, running additional threads and using some intermediate storage, like List. It will improve the performance for sure, but requires additional code.

    3. Numbers most likely have very little sense, but I hope it gives some raw comparison.