Search code examples
hadoopamazon-s3amazon-ec2mapreduceelastic-map-reduce

Estimating computation costs for parallel computing


I am very new to the parallel computing world. My group use Amazon EC2 and S3 to manage all the data and it really opens a new world to me.

My question is how to estimate costs for computation. Suppose I have n TB data with k files on Amazon S3 (for example, I got 0.5 TB data with 7000 zip files), I would like to loop through all the files, and perform one operation of regex matching using Pig Latin for each line of the files.

I am very interested in estimating these costs:

  1. How many instances should I select to perform this task? What are the capacity of the instance (the size of the master instance and the map-reduce instance)? Can I deduct these capacities and costs based on n and k as well as each operation cost?
  2. I have designed an example data flow: I used one xlarge instance as my master node, and 10 medium instances as my map reduce group. Would this be enough?
  3. How to maximize the bandwidth for each of these instances to fetch data from S3? From my designed dataflow, it looks like the reading speed from S3 is about 250,000,000 bytes per minute. How much data exactly are transported to the ec2 instance? Would this be the bottleneck of my job flow?

Solution

  • 1- IMHO, it depends solely on your needs. You need to choose it based on the intensity of computation you are going to perform. You can obviously cut down the cost based on your dataset and the amount of computation you are going to perform on that data.

    2- For how much data?What kind of operations?Latency/throughput?For POCs and small projects it seems good enough.

    3- It actually depends on several things, like - whether you're in the same region as your S3 endpoint, the particular S3 node you're hitting at a point in time etc. You might be better off using an EBS instance if you need quicker data access, IMHO. You could mount an EBS volume to your EC2 instance and keep the data, which you frequently need, there itself. Otherwise some straightforward solutions are using 10 Gigabit connections between servers or perhaps using dedicated(costly) instances. But, nobody can guarantee whether data transfer will be a bottleneck or not. Sometimes it maybe.

    I don't know if this answers you cost queries completely, but their Monthly Calculator would certainly do.