I am trying to run some simple jobs on EMR (AMI 3.6) with Hadoop 2.4 and Spark 1.3.1. I have installed Spark manually without a bootstrap script. Currently I am trying to read and process data from S3 but it seems like I am missing an endless number of jars on my classpath.
Running commands on spark-shell. Starting shell using:
spark-shell --jars jar1.jar,jar2.jar...
Commands run on the shell:
val lines = sc.textFile("s3://folder/file.gz")
lines.collect()
The errors always look something like: "Class xyz not found". After I find the needed jar and add it to the classpath, I will get this error again but with a different class name in the error message.
Is there a set of jars that are needed for working with (compressed and uncompressed) S3 files?
I was able to figure out the jars needed for my classpath by following the logic in the AWS GitHub repo https://github.com/awslabs/emr-bootstrap-actions/tree/master/spark.
The install-spark and install-spark-script.py files contain logic for copying jars into a new 'classpath' directory used by the SPARK_CLASSPATH variable (spark-env.sh).
The jars I was personally missing were located in /usr/share/aws/emr/emrfs/lib/ and /usr/share/aws/emr/lib/