amazon-web-services amazon-s3 amazon-ec2 amazon-ebs

How to use S3 and EBS in tandem for cost effective analytics on AWS?

I receive very large (5TB) .csv files from my clients on S3 buckets. I have to process these files, add columns to them and store them back.

I might need to work with the files in the same way as I increase the number of features for future improved models.

Clearly because S3 stores data as objects, every time I make a change, I have to read and write 5TB of data.

What is the best approach I can take to process these data cost effectively and promptly:

Store a 5TB file on S3 as object, every time read the object, do the processing and save the result back to S3
Store the 5TB on S3 as object, read the object, chunk it to smaller objects and save them back to S3 as multiple objects so in future just work with the chunks I am interested in
Save every thing on EBS from start, mount it to the EC2 and do the processing

Thank you

Solution

First, a warning -- the maximum size of an object in Amazon S3 is 5TB. If you are going to add information that results in a larger object, then you will likely hit that limit.

The smarter way of processing this amount of data is to do it in parallel and preferably in multiple, smaller files rather than a single 5TB file.

Amazon EMR (effectively, a managed Hadoop environment) is excellent for performing distributed operations across large data sets. It can process data from many files in parallel and can compress/decompress data on-the-fly. It's complex to learn, but very efficient and capable.

If you are sticking with your current method of processing the data, I would recommend:

If your application can read directly from S3, use that as the source. Otherwise, copy the file(s) to EBS.
Process the data
Store the output locally in EBS, preferably in smaller files (GBs rather than TBs)
Copy the files to S3 (or keep them on EBS if that meets your needs)