apache-spark amazon-s3 pyspark mesosphere dcos

What jars required to load datasets from S3?

We are experimenting with loading data from Amazon S3 into a Spark 2.3 cluster, which is configured under Mesosphere DC/OS. When we run the code on spark shell, spark is not recognizing S3 file system:

File "/root/spark/spark-2.3.0-bin-hadoop2.7/python/lib/py4j-0.10.6-src.zip/py4j/protocol.py", line 320, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: java.io.IOException: No FileSystem for scheme: s3

What libraries / jars do we need to manually add to Spark in order to make it recognize S3?

Solution

You can read it using 's3a://' instead of s3.