Search code examples
hadoopamazon-ec2apache-sparkhadoop-yarnamazon-emr

Best way to deploy Spark?


Are there substantial advantages to deploying Spark on top of YARN or EMR, instead of EC2? This would be for research and prototyping, primarily, and probably using Scala. Our reluctance to not using EC2 stems primarily from the extra infrastructure and complexity other options involve, but perhaps they provide substantial benefits as well?

We'd mostly be reading/writing data from/to S3.


Solution

  • Let us differentiate the differnt layers: There is the infrastructure layer i.e. on which (virtual) machines should the spark job run. Potential options include local clusters of machines or a cluster of virtual machines rented from EC2. Especially when writing a lot of data from/to S3, EC2 could be a good option as both services are well integrated and usually are run in the same datacenters (giving you better network performance).

    The second layer is then software/scheduling on top i.e. what piece of software connects all these machines to schedule and run your spark job. Here options include Yarn (being the scheduler from the Hadoop project), Mesos (a general purpose scheduler being able to also handle non-hadoop workloads), and Myriad (essentially Yarn on Mesos).

    A good comparison between Yarn and Mesos can be found here.

    EMR gives you the option to easily spin up a Hadoop/YARN cluster. There even exist bootstrap actions letting you install spark on such clusters.

    Hope this helped to answer your question!