Search code examples
solrsolrcloud

Solr data beign indexed in all server[Sharding Mode]


I created three Solr cloud instances for sharding data across three instances and querying from those three instances. I created them using below commands

CMD:

solr.cmd start -c -s Node1 -p 8983
solr.cmd start -c -s Node2 -z localhost:9983 -p 8984
solr.cmd start -c -s Node3 -z localhost:9983 -p 8985

Then I created a collection which uses three shards and has a replication factor of 1.

CMD1:

solr.cmd create_collection -c tests -shards 3 replicationFactor 1

Then I index data into the collection using post jar using following command.

CMD2:

java -jar post.jar *.xml

There was 32 XML files in that location

As per my understanding the data will be split and indexed on all on the three Solr cloud instance.

But what happened was 32 document was indexed on all the three instances.

I confirmed this by using following URLs

http://localhost:8984/solr/tests/select?indent=on&q=*:*&wt=json
http://localhost:8985/solr/tests/select?indent=on&q=*:*&wt=json
http://localhost:8983/solr/tests/select?indent=on&q=*:*&wt=json

Everything returned the same number of records.

And my understanding is the document will be split and indexed on all the three instances.

Since I want to index 3 billion documents into Solr and there is 2 billion hard limit in Solr. I wanted to make sure the they are splited and indexed in the three Solr instances.

let me know if have made any mistakes.

Versions.

Solr =6.1.0
Windows= 7

Solution

  • When you're querying /solr/tests, you're querying the tests collection. Behind the scenes Solr is fetching all the documents in that collection and returning them for you, from all the shards added to the collection.

    You've stumbled upon the idea behind a collection in Solr - regardless of which server you're querying, Solr is returning the result of the collection to you, including all documents added to that collection. The only difference in the three requests you're making, is which server is responsible for returning the result to the client and making the requests to fetch results from the other cores.

    If you want to explore the contents of a single core, these cores are named collectionname_shardX_replicaY. You can examine the current cluster state if you download the json file from the Zookeeper instance - this will show you exactly which shards are located where.

    You can also use the CoreAdmin API on a single node to examine which cores have been placed on that server. Be aware that you do NOT want to do any mutable actions through the CoreAdmin API when you're running in cloud mode.