amazon-web-services amazon-ec2 apache-spark apache-drill

How to optimise aws cluster instance types in apache spark and drill aws cluster?

I am reading s3 buckets with drill and writing it back to s3 with parquet in order to read it with spark data frames for further analysis. I am required by AWS emr to have at least 2 core machines.

will using i mirco instance for master and cores affect performance?

I don't make a use of hdfs as such so I am thinking to make them mirco instances to save money.

All computation will be done in memory by R3.xlarge spot instances as task nodes anyway. And finally does spark utilise multiple cores in each machine? or is it better to launch fleet of task nodes R3.xlarge with 4.1 version so they can be auto resized?

Solution

I don't know how familiar you are with Spark but there is a couple of things you need to know about core usage :

You can set the number of cores to use for the driver process, only in cluster mode. It's 1 by default.
You can also set the number of cores to use on each executor. For YARN and standalone mode only. It's 1 in YARN mode, and all the available cores on the worker in standalone mode. In standalone mode, setting this parameter allows an application to run multiple executors on the same worker, provided that there are enough cores on that worker. Otherwise, only one executor per application will run on each worker.

Now to answer both of your questions :

will using i micro instance for master and cores affect performance?

Yes, the driver needs minimum resources to schedule job, collect data sometimes etc. Performance-wise you'll need to benchmark according to your use case on what suits your usage better which you can do using Ganglia per example on AWS.

does spark utilise multiple cores in each machine?

Yes Spark uses multiple cores on each machine.

You can also read this concerning Which instance type is preferred for AWS EMR cluster for Spark.

The support of Spark is nearly new on AWS, but it's usually close to all other Spark cluster setups.

I advice you to read the AWS EMR developer guide - Plan EMR Instances chapter along with the Spark official documentation guide.