Search code examples
amazon-ec2apache-sparkemr

Spark - Which instance type is preferred for AWS EMR cluster?


I am running some machine learning algorithms on EMR Spark cluster. I am curious about which kind of instance to use so I can get the optimal cost/performance gain?

For the same level of prices, I can choose among:

          vCPU  ECU  Memory(GiB)
m3.xlarge  4     13     15     
c4.xlarge  4     16      7.5
r3.xlarge  4     13     30.5

Which kind of instance should be used in EMR Spark cluster?


Solution

  • Generally speaking, it depends on your use case, needs, etc... But I can suggest a minimum configuration considering the information that you have shared.

    You seem to be trying to train an ALS factorization or SVD on matrices between 2 ~ 4 GBs of data. So actually that's not too much of data.

    You'll be needing at least 1 master and 2 nodes to setup and configure a small distributed cluster. The master won't be doing any computing whatsoever so it won't need much resources but of course I would be dealing task scheduling, etc.

    You can add slaves (instances) according to your needs.

    • 1 x master : m3.xlarge m5.xlarge - vCPU : 4 , RAM : 16 GB with EBS storage.
    • 2 x slaves : c3.4xlarge c5.xlarge - vCPU : 16, RAM : 32 GB with EBS storage.

    EDIT : As mentioned in the comments, 5th generation instances are now available for each of the instance types mentioned in this thread: R5, M5, and C5. In general, latest-generation instance types are cheaper and more performant than their older counterparts.

    C3, C4, and C5 are compute optimized instances featuring high performance processors and with a lowest price/compute performance in EC2 compared to R3, R4 or R5 although it's recommended use cases are distributed memory caches and in-memory analytics. But C5 will do the job for you for a lower price.

    Performance Optimizations :

    • Amazon EMR charges on hourly increments. This means once you run a cluster, you are paying for the entire hour. That's important to remember because if you are paying for a full hour of Amazon EMR cluster, improving your data processing time by matter of minutes may not be worth your time and effort.

    • Don't forget that adding more nodes to increase performance is cheaper than spending time optimizing your cluster.

    Reference : Amazon EMR Best Practices - Parviz Deyhim.

    EDIT : You might also consider enabling Ganglia to monitor your cluster resources: CPU, RAM, Network I/O. This would help you also tuning your EMR cluster. Practically, you don't have any configuration to do. Just follow the documentation to add it to your EMR cluster on creation.