Search code examples
pysparkaws-glue

AWS Glue spark application logs remain inprogress


I am using AWS Glue to run a PySpark job. Job completes successfully (last step is to write dataframe to S3 and I do see that the files are created).

Problem is that I do not get "finished" Spark logs that I can view in the Spark UI.

In the Glue job details I have specified an S3 location for Spark UI logs path, and files do get written there with a name of spark-application-<nnnnnnnnn>.inprogress, but on completion of the job there are no final logs written (without inprogress extension).

Based on the accepted answer to this SO question I tried artificially extending the job time with a time.sleep(2*60) at the end, but it didn't make a difference.

I poked around a bit in the Cloudwatch logs and did find these log messages towards the end:

...
2023-10-11 16:25:20,274 INFO [Executor task launch worker for task 3.0 in stage 1.0 (TID 4)] mapred.SparkHadoopMapRedUtil (Logging.scala:logInfo(61)): attempt_202310111625128823931779720627407_0001_m_000003_4: Committed. Elapsed time: 75 ms.
2023-10-11 16:25:20,276 INFO [Executor task launch worker for task 3.0 in stage 1.0 (TID 4)] executor.Executor (Logging.scala:logInfo(61)): Finished task 3.0 in stage 1.0 (TID 4). 2062 bytes result sent to driver
2023-10-11 16:25:20,885 ERROR [dispatcher-Executor] executor.CoarseGrainedExecutorBackend (Logging.scala:logError(77)): Executor self-exiting due to : Driver 172.16.47.104:34325 disassociated! Shutting down.
2023-10-11 16:25:20,887 INFO [dispatcher-Executor] executor.CoarseGrainedExecutorBackend (Logging.scala:logInfo(61)): Driver from 172.16.47.104:34325 disconnected during shutdown
2023-10-11 16:25:20,889 INFO [CoarseGrainedExecutorBackend-stop-executor] sink.GlueCloudwatchSink (GlueCloudwatchSink.scala:logInfo(22)): CloudwatchSink: SparkContext stopped - not reporting metrics now.
...

Could the error message about driver dissociated and disconnected during shutdown be the reason for not finishing up the logs? What could be causing that error?

EDIT--------------

I tried adding a delay at the end of my job as suggested in this SO answer but that didn't work. Neither did adding os._exit(0), which was recommended in this answer on how to get a job to end gracefully.


Solution

  • Here's what eventually worked for me, YMMV.

    I added the following to my job after job.commit():

    spark.stop()
    time.sleep(60)
    

    Neither one by itself seems to work, but the combination of the two result in another file being written without the inprogress extension, and my Spark UI is able to show me the details.