Search code examples
pythonapache-sparkamazon-s3spark-submit

spark load a s3a file as a df, which command to run?


I have a json file that is valid:

I can successfully import it on a local spark machine

DF = sqlContext.read.json("/home/me/myfile.json")

I have a shell script to submit the job

/home/me/spark/bin/spark-submit \
--master local[*] Code.py 

So far so good, for example DF.show(1) works fine.

Now I am trying to load from a s3a link (which contains exactely the same data as myfile.json).

I have tried

DF = sqlContext.read.json("s3a://some-bucket/myfile.json")

I still run my shell script that contains the same command, i.e.

/home/me/spark/bin/spark-submit \
--master local[*] Code.py 

But this time it does not work, I get the following error

java.lang.ClassNotFoundException: Class org.apache.hadoop.fs.s3a.S3AFileSystem not found

Is my shell script wrong?

PS: I just got the s3a link from someone else. So it's not on my AWS account. I assume that I still can import the data from that link even if I do not know any accesskey or secretkey...


Solution

  • Finally I could resolve the issue. By adding the right .jar file (see my comment below) and setting the AWS_ACCESS_KEY_ID= AWS_SECRET_ACCESS_KEY inside the spark-env.sh which is located in the conf folder of my spark folder.

    Thanks