Amazon EMR Spark Cluster: output/result not visible

I am running a Spark cluster on Amazon EMR. I am running the PageRank example programs on the cluster.

While running the programs on my local machine, I am able to see the output properly. But the same doesn't work on EMR. The S3 folder only shows empty files.

The commands I am using: For starting the cluster:

aws emr create-cluster --name SparkCluster --ami-version 3.2 --instance-type m3.xlarge --instance-count 2 \
  --ec2-attributes KeyName=sparkproj --applications Name=Hive \
  --bootstrap-actions Path=s3://support.elasticmapreduce/spark/install-spark  \
  --log-uri s3://sampleapp-amahajan/output/ \
  --steps Name=SparkHistoryServer,Jar=s3://elasticmapreduce/libs/script-runner/script-runner.jar,Args=s3://support.elasticmapreduce/spark/start-history-server

For adding the job:

aws emr add-steps --cluster-id j-9AWEFYP835GI --steps \
Name=PageRank,Jar=s3://elasticmapreduce/libs/script-runner/script-runner.jar,Args=[/home/hadoop/spark/bin/spark-submit,--deploy-mode,cluster,--master,yarn-cluster,--class,SparkPageRank,s3://sampleapp-amahajan/pagerank_2.10-1.0.jar,s3://sampleapp-amahajan/web-Google.txt,2],ActionOnFailure=CONTINUE

After a few unsuccessful attempts... I made a text file for the output of the job and it is successfully created on my local machine. But I am unable to view the same when I SSH into the cluster. I tried FoxyProxy to view the logs for the instances and neither does anything show up there.

Could you please let me know where I am going wrong?

Thanks!

Solution

How are you writing the text file locally? Generally, EMR jobs save their output to S3, so you could use something like outputRDD.saveToTextFile("s3n://<MY_BUCKET>"). You could also save the output to HDFS, but storing the results to S3 is useful for "ephemeral" clusters-- where you provision an EMR cluster, submit a job, and terminate upon completion.