I am attempting to run 2 pyspark steps in EMR both reading from Kinesis using KinesisUtils. This requires dependent library, spark-streaming-kinesis-asl_2.11.
I'm using Terraform to stand up the EMR cluster and invoke the steps both with args:
--packages org.apache.spark:spark-streaming-kinesis-asl_2.11:2.4.5
There appears to be contention on start up with both steps downloading the jar from maven and causing a checksum failure.
sudo spark-shell --packages org.apache.spark:spark-streaming-kinesis-asl_2.11:2.4.5
This causes problems as spark-shell is only available on the master node and bootstrap tries to run on all nodes.
grep-q'"isMaster":true'/mnt/var/lib/info/instance.json ||{echo "Not running on masternode,nothing further to do" && exit 0;}
That didn't seem to work.
I've attempted to add spark configuration to do this in EMR configuration.json
{
"Classification": "spark-defaults",
"Properties": {
"spark.jars.packages": "org.apache.spark:spark-streaming-kinesis-asl_2.11:2.4.5"
}
}
This also didn't work and seemed to stop all jars being copied to the master node dir
/home/hadoop/.ivy2/cache
What does work manually is logging onto the master node and running
sudo spark-shell --packages org.apache.spark:spark-streaming-kinesis-asl_2.11:2.4.5
Then submitting the jobs manually without the --packages option.
Currently, all I need to do is manually start the failed jobs separately (clone steps in AWS console) and everything runs fine.
I just want to be able to start the cluster with all steps successfully starting, any help would be greatly appreciated.
--jars <s3 location of jar>
in your spark-submit