google-cloud-dataproc google-cloud-dataproc-serverless

How to properly kill a running batch dataproc job?

I had run a long-running batch job in DataProc Serverless. After some time of running, I figured out that running the job any longer was a waste of time and money, and I wanted to stop it.

I couldn't find a way to kill the job. However, there were two other ways.

Cancel the batch
Delete the batch

Initially, I used the first option, and I cancelled the job using:

gcloud dataproc batches cancel BATCH --region=REGION

On the dataproc batch console, it showed the job got cancelled, and I also saw the DCU and shuffle storage usage.

But the surprising point is, I can see the job is still running after one day on the spark history server.

After this, I thought of going with the second option to delete the batch job, and I ran this command.

gcloud dataproc batches delete BATCH --region=REGION

This removed the batch entry from the dataproc batch console, but the job is still seen to be running through the spark history server.

My query is:

What is the best way to kill the job?
Am I still being charged once I canceled the running job?

Solution

What you are observing is a known shortcoming of the Spark and Spark History Server. Spark marks only successfully finished Spark applications as completed and leaves failed/cancelled Spark applications in the in-progress/incomplete state (https://spark.apache.org/docs/latest/monitoring.html#spark-history-server-configuration-options):

Applications which exited without registering themselves as completed will be listed as incomplete —even though they are no longer running. This can happen if an application crashes.

To monitor batch job state you need to use Dataproc API - if Dataproc API/UI shows that state of the batch job is CANCELLED, it means that it does not run anymore, regardless of the Spark application status in the Spark History Server.