Search code examples
amazon-web-servicesamazon-s3pysparkemr

Spark application stops when reading file from s3


I have an application which runs on EMR and reads a csv file from s3. However, the whole thing seems to stop (I've let it run for about an hour) when I try to read in that file from s3. Nothing happens and nothing is written to the logs any more except that the application is still running. The step in which this application is running does not fail!

I've tried copying the file to the cluster via the flag --files of spark-submit and reading it directly within the application with sc.textFile(filename).

Is there anything I am missing?


Solution

  • After a while I finally got back to that problem again and could "solve" it myself (I don't really know what the problem was, though...) It seems like spark was failing to allocate worker nodes. After setting spark.dynamicAllocation.enabled to true everything is working as expected now.