amazon-web-services elasticsearch aws-elasticsearch elasticsearch-snapshot

AWS ElasticsearchService: Automated snapshot is running for more than 20days

We are experiencing lot of failures with ElasticSearch queries for few days. When I monitor the cluster health CPU/JVM Memory utilization is high (almost 98%). While debugging the issue, I found that last automated snapshot is i IN_PROGRESS state for more than 20days, I'm suspecting this is the root cause. But I'm not sure what is causing for long snapshot, and couldn't able to stop/delete that snapshot. When I tried http DELETE request on the repository using postman with aws signature, I got 401 Unauthorized error with message Your request is not allowed.

Can anyone help me understand the long running snapshot issue and how to resolve it.

Thanks in advance.

Solution

This is classical case of stuck snapshot in elasticsearch.Stuck snapshot happens when master node and data node goes out of sync for shard's snapshot state. This usually happens when cluster turns red or some node suddenly drops out of cluster under high JVM pressure.

High CPU/JVM Memory utilization is usually not caused by stuck snapshot. Mostly its other way around, i.e snapshot gets stuck in IN_PROGRESS state because of high JVM utilization. For better performance of elasticsearch cluster you should try to keep JVM below 80%. Scaling up is one option to reduce JVM pressure.

Users are not allowed to access automated snapshots on AWS Elasticsearch. To fix the issue of stuck snapshot in IN_PROGRESS state you should engage AWS Elasticsearch customer support