Search code examples
solrlucenereplicationsolrjsolrcloud

Solr fetchIndex command fails inexplicably on sharded nodes


I'm having a strange issue invoking the fetchIndex command via REST. I'm attempting to use fetchIndex to propagate data from one solrcloud instance to another. My reading of the documentation seems to indicate this should be possible:

fetchindex

Force the specified slave to fetch a copy of the index from its master. http://slave_host:port/solr/core_name/replication?command=fetchindex

If you like, you can pass an extra attribute such as masterUrl or compression (or any other parameter which is specified in the tag) to do a one time replication from a master. This obviates the need for hard-coding the master in the slave.

The issue I'm having is a number of unexpected exceptions when replication begins. For example, from the 'slave' node:

2020-12-15 00:17:17.442 INFO  (explicit-fetchindex-cmd) [   ] o.a.s.h.IndexFetcher Starting replication process
2020-12-15 00:17:17.445 INFO  (explicit-fetchindex-cmd) [   ] o.a.s.h.IndexFetcher Number of files in latest index in master: 17
2020-12-15 00:17:17.449 INFO  (explicit-fetchindex-cmd) [   ] o.a.s.u.DefaultSolrCoreState New IndexWriter is ready to be used.
2020-12-15 00:17:17.449 INFO  (explicit-fetchindex-cmd) [   ] o.a.s.h.IndexFetcher Starting download (fullCopy=false) to NRTCachingDirectory(MMapDirectory@C:\scratch\solr-7.7.3\example\cloud\node1\solr\techproducts_shard1_replica_n1\data\index.20201215001717446 lockFactory=org.apache.lucene.store.NativeFSLockFactory@5577fa1; maxCacheMB=48.0 maxMergeSizeMB=4.0)
2020-12-15 00:17:17.455 ERROR (explicit-fetchindex-cmd) [   ] o.a.s.h.IndexFetcher Error fetching file, doing one retry...:org.apache.solr.common.SolrException: Unable to download _0.si completely. Downloaded 551!=533
        at org.apache.solr.handler.IndexFetcher$FileFetcher.cleanup(IndexFetcher.java:1700)
        at org.apache.solr.handler.IndexFetcher$FileFetcher.fetch(IndexFetcher.java:1580)
        at org.apache.solr.handler.IndexFetcher$FileFetcher.fetchFile(IndexFetcher.java:1550)
        at org.apache.solr.handler.IndexFetcher.downloadIndexFiles(IndexFetcher.java:1030)
        at org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:569)
        at org.apache.solr.handler.IndexFetcher.fetchLatestIndex(IndexFetcher.java:346)
        at org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:425)
        at org.apache.solr.handler.ReplicationHandler.lambda$fetchIndex$0(ReplicationHandler.java:346)
        at java.lang.Thread.run(Thread.java:748)

These exceptions cause the replication to abort. There have been a few questions on SO that reference an error like this (solr ReplicationHandler - SnapPull failed to download files), but none that seem relevant to this situation.

The problem is extremely simple to reproduce, using only basic solr installs and no special data. I am using Solr 7.7.3.

Steps to Reproduce:

  1. Unpack solr on the 'master' machine.
  2. Execute ./bin/solr -e cloud to deploy the example solr cloud. Accept all defaults, except:
    • name the collection 'techproducts' instead of 'gettingstarted'
    • select the 'sample_techproducts_configs' configuration set.
  3. Load the sample techproducts data into solr: bin/post -c techproducts ./example/exampledocs/*.
  4. Repeat steps 1 & 2 on another machine or VM. Do NOT load the techproducts data - we want to replicate it using fetchIndex instead.
  5. Load up postman or the REST client of your choice and invoke the fetchIndex command, on the second machine: GET http://<second machine>:8983/solr/techproducts/replication?command=fetchindex&masterUrl=http://<first machine>:8983/solr/techproducts

This should produce the error output shown above, in the logs of the 'slave' machine. I'm bound by my task to use Solr 7.7.3, but I have tried different JVMs and both Windows and Linux hosts. All combinations yield the same results.

I feel as though I must be missing something, but I'm not sure what. Any advice or suggestions would be extremely helpful.

I am also curious how to properly invoke this behavior programmatically through SolrJ, but that may be best left to another question once this issue has been resolved.

Edit: I've been able to successfully replicate using this procedure by reducing the number of shards/replicas in the example clouds to one. I'm now investigating what I need to do to perform these index replications on a per-shard basis, but I don't yet have the answer.


Solution

  • It turns out that I had conflated collections and cores early in this process and failed to notice. In the REST URL provided,

    GET http://:8983/solr/techproducts/replication?command=fetchindex&masterUrl=http://:8983/solr/techproducts

    I had issued collection names rather than core names. A proper example:

    GET http://:8983/solr/techproducts_shard1_replica_n1/replication?command=fetchindex&masterUrl=http://:8983/solr/techproducts_shard1_replica_n1

    Of course, this REST request needs to be repeated for each core in order to correctly replicate an entire cloud instance. Strangely, Solr does not produce an explicit error message when the replication endpoint is invoked with a collection rather than a core, but attempts the replication nonetheless. Naturally, when more than one shard is involved, this results in the destination node trying to hit a "moving target" - queries directed to collections may hit any core, and those files will not match expectations, resulting in the error messages above.