Search code examples
architecturesolrlucenehigh-availabilitydataimporthandler

After a SWAP core (reindex -> live) my slave machine is corrupt


I'm having a very weird issue here, I have set-up my solr architecture as I supposed here and everything works fine except for that my Solr slave machine gets corrupt. The (test) set-up is as follows:

Solr Setup

  • One Solr master with two cores (reindex and live)
  • One Solr slave with two cores (reindex and live)
  • Pollinterval set to 00:00:02
  • Replicating after a optimize or commit
  • Running Solr under Ubuntu (11.10) version 1.4.0.

The process

The whole proces goes like this:

  1. We wipe the index on the master reindex core
  2. We disable replication on the master for the reindex core
  3. We start the DataImportHandler import MySQL data in peices of 10.000 documents (to prevent long mysql locks)
  4. Afther the DataImportHandler is done (4 million records) we enable replication on the master
  5. The slaves will fetch the new reindex core from the master
  6. Once fetching is done by looking at the 'isReplicating' property we issue a core swap

Let's assume that the live core had 5000 documents and the reindex core 20000 after a complete rebuild.

The (wrong) result

After the swap this is the result:

Master: Live (20000), Reindex (5000) Slave: Live (20000), Reindex (20000)

It looks like the slave copies his reindex core to the live core but doesn't exchange the live core with the reindex core like the master does. The problem is I guess that the slave thinks it's newer then the master and the result is replication is dead on the reindex core.

Here are some log messages EXACTLY after a SWAP is issued:

13-mrt-2012 14:29:46 org.apache.solr.handler.SnapPuller fetchLatestIndex
INFO: Master's version: 1328526603592, generation: 332
13-mrt-2012 14:29:46 org.apache.solr.handler.SnapPuller fetchLatestIndex
INFO: Slave's version: 1328526603596, generation: 336
13-mrt-2012 14:29:46 org.apache.solr.handler.SnapPuller fetchLatestIndex
INFO: Starting replication process
13-mrt-2012 14:29:46 org.apache.solr.handler.SnapPuller fetchLatestIndex
INFO: Master's version: 1328526603596, generation: 336
13-mrt-2012 14:29:46 org.apache.solr.handler.SnapPuller fetchLatestIndex
INFO: Slave's version: 1328526603592, generation: 332
13-mrt-2012 14:29:46 org.apache.solr.handler.SnapPuller fetchLatestIndex
INFO: Starting replication process
13-mrt-2012 14:29:46 org.apache.solr.handler.SnapPuller fetchLatestIndex
INFO: Number of files in latest index in master: 1
13-mrt-2012 14:29:46 org.apache.solr.handler.SnapPuller fetchLatestIndex
INFO: Number of files in latest index in master: 11
13-mrt-2012 14:29:46 org.apache.solr.handler.SnapPuller downloadIndexFiles
INFO: Skipping download for /usr/share/solr/data/index/_ct.tvx
13-mrt-2012 14:29:46 org.apache.solr.handler.SnapPuller downloadIndexFiles
INFO: Skipping download for /usr/share/solr/data/index/_ct.fdt
13-mrt-2012 14:29:46 org.apache.solr.handler.SnapPuller downloadIndexFiles
INFO: Skipping download for /usr/share/solr/data/index/_ct.frq
13-mrt-2012 14:29:46 org.apache.solr.handler.SnapPuller downloadIndexFiles
INFO: Skipping download for /usr/share/solr/data/index/segments_9c
13-mrt-2012 14:29:46 org.apache.solr.handler.SnapPuller downloadIndexFiles
INFO: Skipping download for /usr/share/solr/data/index/_ct.fdx
13-mrt-2012 14:29:46 org.apache.solr.handler.SnapPuller downloadIndexFiles
INFO: Skipping download for /usr/share/solr/data/index/_ct.prx
13-mrt-2012 14:29:46 org.apache.solr.handler.SnapPuller downloadIndexFiles
INFO: Skipping download for /usr/share/solr/data/index/_ct.tii
13-mrt-2012 14:29:46 org.apache.solr.handler.SnapPuller downloadIndexFiles
INFO: Skipping download for /usr/share/solr/data/index/_ct.tvd
13-mrt-2012 14:29:46 org.apache.solr.handler.SnapPuller downloadIndexFiles
INFO: Skipping download for /usr/share/solr/data/index/_ct.tis
13-mrt-2012 14:29:46 org.apache.solr.handler.SnapPuller$FileFetcher fetchPackets
WARNING: No content recieved for file: {name=segments_98, lastmodified=0, size=0}
13-mrt-2012 14:29:46 org.apache.solr.handler.SnapPuller downloadIndexFiles
INFO: Skipping download for /usr/share/solr/data/index/_ct.tvf
13-mrt-2012 14:29:46 org.apache.solr.handler.SnapPuller downloadIndexFiles
INFO: Skipping download for /usr/share/solr/data/index/_ct.fnm
13-mrt-2012 14:29:46 org.apache.solr.handler.SnapPuller fetchLatestIndex
INFO: Total time taken for download : 0 secs
13-mrt-2012 14:29:46 org.apache.solr.update.DirectUpdateHandler2 commit
INFO: start commit(optimize=false,waitFlush=true,waitSearcher=true,expungeDeletes=false)
13-mrt-2012 14:29:46 org.apache.solr.handler.SnapPuller fetchLatestIndex
INFO: Total time taken for download : 0 secs
13-mrt-2012 14:29:46 org.apache.solr.handler.SnapPuller modifyIndexProps
INFO: New index installed. Updating index properties...
13-mrt-2012 14:29:46 org.apache.solr.update.DirectUpdateHandler2 commit
INFO: start commit(optimize=false,waitFlush=true,waitSearcher=true,expungeDeletes=false)
13-mrt-2012 14:29:46 org.apache.solr.search.SolrIndexSearcher <init>
INFO: Opening Searcher@598d00 main
13-mrt-2012 14:29:46 org.apache.solr.search.SolrIndexSearcher warm
INFO: autowarming Searcher@598d00 main from Searcher@189c12a main
    fieldValueCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
13-mrt-2012 14:29:46 org.apache.solr.search.SolrIndexSearcher warm
INFO: autowarming result for Searcher@598d00 main
    fieldValueCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
13-mrt-2012 14:29:46 org.apache.solr.search.SolrIndexSearcher warm
INFO: autowarming Searcher@598d00 main from Searcher@189c12a main
    filterCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=34,warmupTime=1,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
13-mrt-2012 14:29:46 org.apache.solr.update.DirectUpdateHandler2 commit
INFO: end_commit_flush
13-mrt-2012 14:29:46 org.apache.solr.handler.ReplicationHandler doFetch
SEVERE: SnapPull failed 
org.apache.solr.common.SolrException: Index fetch failed : 
    at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:329)
    at org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:264)
    at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:159)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
    at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:165)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:267)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
    at java.lang.Thread.run(Thread.java:679)
Caused by: java.lang.RuntimeException: java.io.IOException: read past EOF
    at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1069)
    at org.apache.solr.update.DirectUpdateHandler2.commit(DirectUpdateHandler2.java:418)
    at org.apache.solr.handler.SnapPuller.doCommit(SnapPuller.java:467)
    at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:319)
    ... 10 more
Caused by: java.io.IOException: read past EOF
    at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:151)
    at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38)
    at org.apache.lucene.store.ChecksumIndexInput.readByte(ChecksumIndexInput.java:36)
    at org.apache.lucene.store.IndexInput.readInt(IndexInput.java:68)
    at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:243)
    at org.apache.lucene.index.DirectoryReader$1.doBody(DirectoryReader.java:76)
    at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:683)
    at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:73)
    at org.apache.lucene.index.IndexReader.open(IndexReader.java:476)
    at org.apache.lucene.index.IndexReader.open(IndexReader.java:403)
    at org.apache.solr.core.StandardIndexReaderFactory.newReader(StandardIndexReaderFactory.java:38)
    at org.apache.solr.core.SolrCore.getSearcher(SolrCore.java:1058)
    ... 13 more
13-mrt-2012 14:29:46 org.apache.solr.search.SolrIndexSearcher warm
INFO: autowarming result for Searcher@598d00 main
    filterCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=34,warmupTime=10,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
13-mrt-2012 14:29:46 org.apache.solr.core.SolrCore registerSearcher
INFO: [live] Registered new searcher Searcher@598d00 main
13-mrt-2012 14:29:46 org.apache.solr.search.SolrIndexSearcher close
INFO: Closing Searcher@189c12a main
    fieldValueCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=0,warmupTime=0,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
    filterCache{lookups=0,hits=0,hitratio=0.00,inserts=0,evictions=0,size=34,warmupTime=1,cumulative_lookups=0,cumulative_hits=0,cumulative_hitratio=0.00,cumulative_inserts=0,cumulative_evictions=0}
13-mrt-2012 14:29:46 org.apache.solr.handler.SnapPuller doCommit
INFO: Force open index writer to make sure older index files get deleted
13-mrt-2012 14:29:46 org.apache.solr.handler.ReplicationHandler doFetch
SEVERE: SnapPull failed 
org.apache.solr.common.SolrException: Index fetch failed : 
    at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:329)
    at org.apache.solr.handler.ReplicationHandler.doFetch(ReplicationHandler.java:264)
    at org.apache.solr.handler.SnapPuller$1.run(SnapPuller.java:159)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
    at java.util.concurrent.FutureTask$Sync.innerRunAndReset(FutureTask.java:351)
    at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:178)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:165)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:267)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
    at java.lang.Thread.run(Thread.java:679)
Caused by: java.io.IOException: read past EOF
    at org.apache.lucene.store.BufferedIndexInput.refill(BufferedIndexInput.java:151)
    at org.apache.lucene.store.BufferedIndexInput.readByte(BufferedIndexInput.java:38)
    at org.apache.lucene.store.ChecksumIndexInput.readByte(ChecksumIndexInput.java:36)
    at org.apache.lucene.store.IndexInput.readInt(IndexInput.java:68)
    at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:243)
    at org.apache.lucene.index.SegmentInfos$1.doBody(SegmentInfos.java:309)
    at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:683)
    at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:517)
    at org.apache.lucene.index.SegmentInfos.read(SegmentInfos.java:306)
    at org.apache.lucene.index.IndexWriter.init(IndexWriter.java:1596)
    at org.apache.lucene.index.IndexWriter.<init>(IndexWriter.java:1418)
    at org.apache.solr.update.SolrIndexWriter.<init>(SolrIndexWriter.java:191)
    at org.apache.solr.update.UpdateHandler.createMainIndexWriter(UpdateHandler.java:98)
    at org.apache.solr.update.DirectUpdateHandler2.openWriter(DirectUpdateHandler2.java:173)
    at org.apache.solr.update.DirectUpdateHandler2.forceOpenWriter(DirectUpdateHandler2.java:376)
    at org.apache.solr.handler.SnapPuller.doCommit(SnapPuller.java:471)
    at org.apache.solr.handler.SnapPuller.fetchLatestIndex(SnapPuller.java:319)
    ... 10 more

Until step 6 everything is okay if I do everything manually. Everything collaps after the core swap.

I hope that someone could help me or telling me what I am doing wrong.. Should replication be disabled when you do a core swap? Any tips?


Solution

  • I managed to solve this issue, the problem was that both of my cores were using the same dataDir. By setting the dataDir in the solr.xml file to both different directories solved the problem!