I need to export a large number of Elasticsearch indices to S3 in JSON format, where each index is around 50GB in size. I've been looking into a number of ways of doing this, but I need the most time efficient method, due to the size of the data.
I tried elasticdump, but from testing this out, I think it stores the whole index in memory before dumping it as a file to S3. So I'd need an EC2 instance with memory in excess of 50G. Is there any way of getting it to dump a series of smaller sized files instead of one huge file?
There are other options like using Logstash, or python's Elasticsearch library with possible helpers to do the operation.
What would be the best method for this?
To move raw json
from Elasticsearch to S3 bucket, you can use the s3 output in logstash pipeline. Here is the example pipeline to follow
input {
elasticsearch {
hosts => ["localhost:9200"]
index => "myindex-*"
query => '{ "query": { "match_all": {} } }'
}
}
filter {
# Your filter configuration here
}
output {
s3 {
bucket => "BUCKET_NAME"
region => "us-east-1"
access_key_id => "ACCESS_KEY"
secret_access_key => "SECRET_KEY"
canned_acl => "private"
prefix => "logs/" # Optional
time_file => 300
codec => json_lines {}
codec => plain {
format => "%{[message]}"
}
}
}
S3 output plugin - parameters
time_file => 300
5 minutes (5 minutes * 60 seconds per minute = 300 seconds).json_lines
to encode the JSON data, and plain to format the data as a string.If you are running this pipeline in ECS Container or EC2 you don't need to provide the ACCESS_KEY
and SECRET_KEY
, you can create the role and assign to ECS or EC2 for security reasons.