Search code examples
solrsolrj

Strange Behavior with ConcurrentUpdateSolrServer Class


I'm using Solrj to index some files but I've noticed a weird behavior using the ConcurrentUpdateSolrServer Class. My aim is to index files very fastly (15000 documents per seconds).

I've set up one Solr instance on a distant Virtual Machine (VM) on Linux with 8 CPUs and I've implemented a java program with Solrj on my computer using Eclipse. I will describe both scenarios I've tried in order to explain my problem :

Scenario 1 :

I've run my java program using eclipse to index my documents defining my server with the adress of my VM like that :

String url = "http://10.35.1.72:8080/solr/";
ConcurrentUpdateSolrServer server = new  ConcurrentUpdateSolrServer(url,4000,20);

And I've added my documents creating a java class that extends Thread doing that :

@Override
public void run(){
SolrInputDocument doc = new SolrInputDocument();
/*
 * Processing on document to add fields ...
 */
UpdateResponse response = server.add(doc);

/*
 * Response's Analysis 
 */

But to avoid to add documents in a sequential way, I've used an Executor to add my documents in a parallel way like this :

Executor executor = Executors.newFixedThreadPool(nbThreads);
for (int j = 0; j < myfileList.size(); j++) {
     executor.execute(new myclassThread(server,new myfileList(j)));
}

When I run this program, the result is fine. All my documents are well indexed in the solr indexes. I can see it on the solr admin :

Results :
numDocs: 3588
maxDoc: 3588
deletedDocs: 0

The problem is that indexing performances are very low (slow indexing speed) compared to an indexing without using solrj and an indexing on the VM. That's why, I've created a jar file of my program to run it on my VM.

Scenario 2 :

So, I've generated a jar file with eclipse and run it on my VM. I've change the server's url like this :

String url = "http://localhost:8080/solr/";
ConcurrentUpdateSolrServer server = new  ConcurrentUpdateSolrServer(url,4000,20);

I've run my jar file like this with the same documents collection (3588 documents with unique id):

java -jar myJavaProgram.jar

And the result on the Solr Admin is :

Results :
numDocs: 2554
maxDoc: 3475
deletedDocs: 921

This result depend of my thread setting (for Executor and SolrServer). To finish, not all the documents are indexed but indexing speed is better. I guess that the adding of my documents is too much fast for Solr and there are some losses.

I didn't succeeded to find the right setting of my threads. No matter if I set plenty or few threads, in any case, I have losses.

Questions :

  • Does anyone have heard a problem with the ConcurrentUpdateSolrServer Class ?
  • Is there an explanation of these losses ? Why all my documents are not indexed in the second scenario ? And why some documents are deleted even they have a unique key ?
  • Is there a proper way to add documents with Solrj in a parallel way (not sequential) ?
  • I've seen another Solrj class to index the data : EmbeddedSolrServer. Does this class allow to improve the indexing speed or is safer than ConcurrentUpdateSolrServer to index data ?
  • When I analyse the response of the add() method, I've notice that the result is always OK (response.getstatut()=0) but it's not the true because my documents are not well indexed. So, is it a normal behavior of this add() method or not ?
  • To finish, if someone can advise me on the way to index data very fastly, I will appreciate a lot ! :-)

Edit :

I've tried to slow down my indexing speed using Thread.sleep(time) between each call of the add() method of the ConcurrentUpdateServer.

I've tried to commit() after each call of the add() method of the ConcurrentUpdateServer (I know that is not a good solution to commit at each adding but it was to test).

I've tried to not use Executor to manage my threads and I've created one or several static threads.

After testing these several strategies to index my document collection, I've decided to use the EmbeddedSolrServer class to see if the results are better.

So I've implement this code to use the EmbeddedSolrServer :

 final File solrConfigXml = new File( "/home/usersolr/solr-4.2.1/indexation_test1/solr/solr.xml" );
 final String solrHome = "/home/usersolr/solr-4.2.1/indexation_test1/solr";
 CoreContainer coreContainer;
    try{
        coreContainer = new CoreContainer( solrHome, solrConfigXml );
    }catch( Exception e ){
        e.printStackTrace( System.err );
        throw new RuntimeException( e );
    }
    EmbeddedSolrServer server = new EmbeddedSolrServer( coreContainer, "collection1" );     

I added the right JARs to make it work and I succeeded to index my collection.

But, after these tries, I still get in trouble with the behavior of Solr... I still have the same losses.

Result :
Number of documents indexed :2554

2554 docs / 3588 docs (myCollection) ...

I guess that my problem is more technical. But my computing knowledge stops there ! :( Why do I get some losses when I index my documents on my VM while I don't have these losses when I execute my java program from my computer ?

Is there a link with Jetty (maybe it cannot absorbe the input stream ?) ? Are there some components (buffers, RAM overflow ?) that have some limits on Solr ?

If I'm not enough clear about my problem, please, tell me and I'll try to make it clearer.

Thanks

Corentin


Solution

  • It was just a mistake in my code. My files were not read in the same order on my computer and on my VM. So the problem's causes don't come from Solr. It is because of me.