Search code examples
apache-sparkamazon-emrapache-spark-2.0

Huge delays translating the DAG to tasks


this are my steps:

  1. Submit the spark app to a EMR cluster
  2. The driver starts and I can see the Spark-ui (no stages have been created yet)
  3. The driver reads an orc file with ~3000 parts from s3, make some transformations and save it back to s3
  4. The execution of the save should create some stages in the spark-ui but the stages take really long time to appear in the spark-ui
  5. The stages appear and start the execution

Why am I getting that huge delay in step 4? During this time the cluster is apparently waiting for something and the CPU usage is 0%

Thanks


Solution

  • Despite its merits S3 is not a file system and it makes it a suboptimal choice for working with complex binary formats which are typically designed with actual file system in mind. In many cases secondary tasks (like reading metadata) are more expensive than the actual data fetching.