apache-spark hadoop pyspark amazon-emr amazon-kinesis

AWS EMR Multiple Jobs Dependency Contention

Problem

I am attempting to run 2 pyspark steps in EMR both reading from Kinesis using KinesisUtils. This requires dependent library, spark-streaming-kinesis-asl_2.11.

I'm using Terraform to stand up the EMR cluster and invoke the steps both with args:

--packages org.apache.spark:spark-streaming-kinesis-asl_2.11:2.4.5

There appears to be contention on start up with both steps downloading the jar from maven and causing a checksum failure.

Things attempted

I've tried to move the download of the jar to the bootstrap bash script using:

sudo spark-shell --packages org.apache.spark:spark-streaming-kinesis-asl_2.11:2.4.5

This causes problems as spark-shell is only available on the master node and bootstrap tries to run on all nodes.

I've tried to limit the above to only run on master using

grep-q'"isMaster":true'/mnt/var/lib/info/instance.json ||{echo "Not running on masternode,nothing further to do" && exit 0;}

That didn't seem to work.

I've attempted to add spark configuration to do this in EMR configuration.json

{

"Classification": "spark-defaults",

"Properties": {
```
"spark.jars.packages": "org.apache.spark:spark-streaming-kinesis-asl_2.11:2.4.5"
```
}

}

This also didn't work and seemed to stop all jars being copied to the master node dir

/home/hadoop/.ivy2/cache

What does work manually is logging onto the master node and running

sudo spark-shell --packages org.apache.spark:spark-streaming-kinesis-asl_2.11:2.4.5

Then submitting the jobs manually without the --packages option.

Currently, all I need to do is manually start the failed jobs separately (clone steps in AWS console) and everything runs fine.

I just want to be able to start the cluster with all steps successfully starting, any help would be greatly appreciated.

Solution

Download the required jars and upload them to s3.(One time)
While running your pyspark jobs from step, pass --jars <s3 location of jar> in your spark-submit