I am very new to the parallel computing world. My group use Amazon EC2 and S3 to manage all the data and it really opens a new world to me.
My question is how to estimate costs for computation. Suppose I have n TB data with k files on Amazon S3 (for example, I got 0.5 TB data with 7000 zip files), I would like to loop through all the files, and perform one operation of regex matching using Pig Latin for each line of the files.
I am very interested in estimating these costs:
1- IMHO, it depends solely on your needs. You need to choose it based on the intensity of computation you are going to perform. You can obviously cut down the cost based on your dataset and the amount of computation you are going to perform on that data.
2- For how much data?What kind of operations?Latency/throughput?For POCs and small projects it seems good enough.
3- It actually depends on several things, like - whether you're in the same region as your S3 endpoint, the particular S3 node you're hitting at a point in time etc. You might be better off using an EBS instance if you need quicker data access, IMHO. You could mount an EBS volume to your EC2 instance and keep the data, which you frequently need, there itself. Otherwise some straightforward solutions are using 10 Gigabit connections between servers or perhaps using dedicated(costly) instances. But, nobody can guarantee whether data transfer will be a bottleneck or not. Sometimes it maybe.
I don't know if this answers you cost queries completely, but their Monthly Calculator would certainly do.