Search code examples
elasticsearchamazon-s3logstash

Export 50 GB Elasticsearch indices to S3 as JSON/text


I need to export a large number of Elasticsearch indices to S3 in JSON format, where each index is around 50GB in size. I've been looking into a number of ways of doing this, but I need the most time efficient method, due to the size of the data.

I tried elasticdump, but from testing this out, I think it stores the whole index in memory before dumping it as a file to S3. So I'd need an EC2 instance with memory in excess of 50G. Is there any way of getting it to dump a series of smaller sized files instead of one huge file?

There are other options like using Logstash, or python's Elasticsearch library with possible helpers to do the operation.

What would be the best method for this?


Solution

  • ES to S3 Logstash Pipeline

    To move raw json from Elasticsearch to S3 bucket, you can use the s3 output in logstash pipeline. Here is the example pipeline to follow

    input {
      elasticsearch {
        hosts => ["localhost:9200"]
        index => "myindex-*"
        query => '{ "query": { "match_all": {} } }'
      }
    }
    
    filter {
      # Your filter configuration here
    }
    
    output {
      s3 {
        bucket => "BUCKET_NAME"
        region => "us-east-1"
        access_key_id => "ACCESS_KEY"
        secret_access_key => "SECRET_KEY"
        canned_acl => "private"
        prefix => "logs/" # Optional 
        time_file => 300
        codec => json_lines {}
        codec => plain {
          format => "%{[message]}"
        }
      }
    }
    
    

    S3 output plugin - parameters

    • bucket: The name of the S3 bucket to save the data to.
    • region: The AWS region that the S3 bucket is located in.
    • access_key_id: The AWS access key ID with permission to write to the S3 bucket.
    • secret_access_key: The AWS secret access key associated with the access key ID.
    • prefix: A prefix to be added to the object key of the saved data.
    • time_file: The maximum time in seconds to buffer data before flushing it to S3. ex: time_file => 300 5 minutes (5 minutes * 60 seconds per minute = 300 seconds).
    • codec: The codec used to encode the data to be saved. In this example, we are using two codecs - json_lines to encode the JSON data, and plain to format the data as a string.

    If you are running this pipeline in ECS Container or EC2 you don't need to provide the ACCESS_KEY and SECRET_KEY, you can create the role and assign to ECS or EC2 for security reasons.