Spark history server stops working in EMR when logs get large

I am running a spark job on a 10 TB dataset using EMR. I am using the Spark history server to monitor its progress. However, when the logs get really large, the spark history server and the EMR UI both stop updating. Is my EMR job still running or has it stopped working too?

Furthermore, when the spark history server stops crashing, all my EC2 instances go from > 75% CPU utilization to 0% utilization (they subsequently increase back to <75%) and the EMR console shows 0 containers reserved and all memory freed (they also return to normal after).

Has something happened to my EMR job? Is there a way I can keep the Spark history server working when the logs get really large?

Thanks.

Solution

Yes this can happen due to large amount of log history, you can try to schedule/set auto delete.

For history log cleanup, you can set following properties to enable set auto cleaning in spark-defaults.conf file, and restart the server

spark.history.fs.cleaner.enabled true
spark.history.fs.cleaner.maxAge  12h
spark.history.fs.cleaner.interval 1h