elasticsearch google-cloud-platform google-cloud-storage elasticsearch-curator

Archive old data from Elasticsearch to Google Cloud Storage

I have an elasticsearch server installed in Google Compute Instance. A huge amount of data is being ingested every minute and the underline disk fills up pretty quickly.

I understand we can increase the size of the disks but this would cost a lot for storing the long term data.

We need 90 days of data in the Elasticsearch server (Compute engine disk) and data older than 90 days (till 7 years) to be stored in Google Cloud Storage Buckets. The older data should be retrievable in case needed for later analysis.

One way I know is to take snapshots frequently and delete the indices older than 90 days from Elasticsearch server using Curator. This way I can keep the disks free and minimize the storage cost.

Is there any other way this can be done without manually automating the above-mentioned idea?

For example, something provided by Elasticsearch out of the box, that archives the data older than 90 days itself and keeps the data files in the disk, we can then manually move this file form the disk the Google Cloud Storage.

Solution

There is no other way around, to make backups of your data you need to use the snapshot/restore API, it is the only safe and reliable option available.

There is a plugin to use google cloud storage as a repository.

If you are using version 7.5+ and Kibana with the basic license, you can configure the Snapshot directly from the Kibana interface, if you are on an older version or do not have Kibana you will need to rely on Curator or a custom script running with a crontab scheduler.

While you can copy the data directory, you would need to stop your entire cluster everytime you want to copy the data, and to restore it you would also need to create a new cluster from scratch every time, this is a lot of work and not practical when you have something like the snapshot/restore API.