One of my ES nodes has failed because of java.lang.OutOfMemoryError: Java heap space
error. Here is the full stack trace from the logs:
[2020-09-18T04:25:04,215][WARN ][o.e.a.b.TransportShardBulkAction] [search1] [[my_index_4][0]] failed to perform indices:data/write/bulk[s] on replica [my_index_4][0], node[cm_76wfGRFm9nbPR1mJxTQ], [R], s[STARTED], a[id=BUpviwHxQK2qC3GrELC2Hw]
org.elasticsearch.transport.NodeDisconnectedException: [search3][X.X.X.179:9300][indices:data/write/bulk[s][r]] disconnected
[2020-09-18T04:25:04,215][WARN ][o.e.c.a.s.ShardStateAction] [search1] [my_index_4][0] received shard failed for shard id [[my_index_4][0]], allocation id [BUpviwHxQK2qC3GrELC2Hw], primary term [2], message [failed to perform indices:data/write/bulk[s] on replica [my_index_4][0], node[cm_76wfGRFm9nbPR1mJxTQ], [R], s[STARTED], a[id=BUpviwHxQK2qC3GrELC2Hw]], failure [NodeDisconnectedException[[search3][X.X.X.179:9300][indices:data/write/bulk[s][r]] disconnected]]
org.elasticsearch.transport.NodeDisconnectedException: [search3][X.X.X.179:9300][indices:data/write/bulk[s][r]] disconnected
[2020-09-18T04:25:04,215][DEBUG][o.e.a.a.c.n.i.TransportNodesInfoAction] [search1] failed to execute on node [cm_76wfGRFm9nbPR1mJxTQ]
org.elasticsearch.transport.NodeDisconnectedException: [search3][X.X.X.179:9300][cluster:monitor/nodes/info[n]] disconnected
[2020-09-18T04:25:04,219][INFO ][o.e.c.r.a.AllocationService] [search1] Cluster health status changed from [GREEN] to [YELLOW] (reason: [shards failed [[my_index_4][0]] ...]).
[2020-09-18T04:25:05,450][INFO ][o.e.m.j.JvmGcMonitorService] [search1] [gc][11099506] overhead, spent [605ms] collecting in the last [1.4s]
[2020-09-18T04:25:05,453][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [search1] fatal error in thread [elasticsearch[search1][search][T#5]], exiting
java.lang.OutOfMemoryError: Java heap space
at org.elasticsearch.search.aggregations.bucket.composite.CompositeValuesSource$GlobalOrdinalValuesSource.<init>(CompositeValuesSource.java:137) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.search.aggregations.bucket.composite.CompositeValuesSource.wrapGlobalOrdinals(CompositeValuesSource.java:123) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.search.aggregations.bucket.composite.CompositeValuesComparator.<init>(CompositeValuesComparator.java:50) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.search.aggregations.bucket.composite.CompositeAggregator.<init>(CompositeAggregator.java:69) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.search.aggregations.bucket.composite.CompositeAggregationFactory.createInternal(CompositeAggregationFactory.java:52) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.search.aggregations.AggregatorFactory.create(AggregatorFactory.java:216) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.search.aggregations.AggregatorFactories.createTopLevelAggregators(AggregatorFactories.java:216) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.search.aggregations.AggregationPhase.preProcess(AggregationPhase.java:55) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.search.query.QueryPhase.execute(QueryPhase.java:105) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.indices.IndicesService.lambda$loadIntoContext$14(IndicesService.java:1133) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.indices.IndicesService$$Lambda$2241/341562582.accept(Unknown Source) ~[?:?]
at org.elasticsearch.indices.IndicesService.lambda$cacheShardLevelResult$15(IndicesService.java:1186) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.indices.IndicesService$$Lambda$2242/1286052129.get(Unknown Source) ~[?:?]
at org.elasticsearch.indices.IndicesRequestCache$Loader.load(IndicesRequestCache.java:160) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.indices.IndicesRequestCache$Loader.load(IndicesRequestCache.java:143) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.common.cache.Cache.computeIfAbsent(Cache.java:412) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.indices.IndicesRequestCache.getOrCompute(IndicesRequestCache.java:116) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.indices.IndicesService.cacheShardLevelResult(IndicesService.java:1192) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.indices.IndicesService.loadIntoContext(IndicesService.java:1132) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.search.SearchService.loadOrExecuteQueryPhase(SearchService.java:305) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.search.SearchService.executeQueryPhase(SearchService.java:340) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.search.SearchService$2.onResponse(SearchService.java:316) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.search.SearchService$2.onResponse(SearchService.java:312) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.search.SearchService$3.doRun(SearchService.java:1002) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:672) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:41) ~[elasticsearch-6.2.4.jar:6.2.4]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.2.4.jar:6.2.4]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_171]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_171]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_171]
Because of the exception above, I am getting master_not_discovered_exception
when I am hitting any of ES APIs.
Question: Can anyone tell me the next steps that I should perform to put Elasticsearch back to normal state? Is there a way to restart disconnected node?
First let me briefly explains what might have caused this issue:
Now coming to resolution part
This ES node is dead, which is causing master_not_discovered_exception
hence its important to bring restart this node again and see if this exception goes.
Prevention of OOM exception