I receive very large (5TB) .csv files from my clients on S3 buckets. I have to process these files, add columns to them and store them back.
I might need to work with the files in the same way as I increase the number of features for future improved models.
Clearly because S3 stores data as objects, every time I make a change, I have to read and write 5TB of data.
What is the best approach I can take to process these data cost effectively and promptly:
Thank you
First, a warning -- the maximum size of an object in Amazon S3 is 5TB. If you are going to add information that results in a larger object, then you will likely hit that limit.
The smarter way of processing this amount of data is to do it in parallel and preferably in multiple, smaller files rather than a single 5TB file.
Amazon EMR (effectively, a managed Hadoop environment) is excellent for performing distributed operations across large data sets. It can process data from many files in parallel and can compress/decompress data on-the-fly. It's complex to learn, but very efficient and capable.
If you are sticking with your current method of processing the data, I would recommend: