I want to establish a SolrCloud clsuter for over 10 millions of news articles. After reading this article: Shards and Indexing Data in SolrCloud, I have a plan as follows:
Practically, I got some questions:
EDIT @ 2015/9/2:
Answer-1: If have the schema (structure) of the document then you can provide the same in schema.xml
configuration or you can use Solr's schema-less
mode for indexing the document. The schema-less
mode will automatically identify the fields in your document and index them. The configuration of schema-less
mode is little different then schema based configuration mode in solr. Afterwards, you need to send the documents to solr for indexing using curl or solrj java api. Essentially, solr provides rest end points for all the different operations. You can write the client in any language which suits you better.
Answer-2: What you have mentioned in your plan, use of compositeId
, is called custom sharding. Because you are deciding to which shard a particular document should go.
Answer-3: I would suggest to go with auto-sharding feature if are not certain how much data you need to index at present and in future. As the index size increases you can split the shards and scale the solr horizontally.
Answer-4: I went through the solr documentation, did not find anywhere mentioning _route_
as mandatory parameter. But in some situations, this may improve query performance because it overcomes network latency when querying all the shards.
Answer-5: The meaning of auto-sharding is routing the document to a shards, based on the hash range assigned while creating the shards. It does not create the new shards automatically, just by specifying a new prefix for compositeId
. So once the index grows large enough in size, you might need to split it. Check here for more.