Search code examples
scalaapache-sparkgoogle-cloud-platformlog4jgoogle-cloud-dataproc

gcloud CLI application logs to bucket


There are Scala application Spark jobs that run daily in GCP. I am trying to set up a notification to be sent when run is compeleted. So, one way I thought of doing that was to get the logs and grep for a specific completion message from it (not sure if there's a better way). But I figured out the logs are just being shown in the console, inside the job details page and not being saved on a file.

Is there a way to route these logs to a file in a bucket so that I can search in it? Do I have to specify where to show these logs in the log4j properties file, like give a bucket location to log4j.appender.stdout = org.apache.log4j.ConsoleAppender

I tried to submit a job with this but it's giving me this error: grep:**-2022-07-08.log: No such file or directory

...

gcloud dataproc jobs submit spark \
    --project $PROJECT --cluster=$CLUSTER --region=$REGION --class=***.spark.offer.Main \
    --jars=gs://**.jar\
    --properties=driver-memory=10G,spark.ui.filters="",spark.memory.fraction=0.6,spark.sql.files.maxPartitionBytes=5368709120,spark.memory.storageFraction=0.1,spark.driver.extraJavaOptions="-Dcq.config.name=gcp.conf",spark.executor.extraJavaOptions="-Dlog4j.configuration=log4j-executor.properties -Dcq.config.name=gcp.conf" \
    --gcp.conf > gs://***-$date.log  2>&1


Solution

  • By default, Dataproc job driver logs are saved in GCS at the Dataproc-generated driverOutputResourceUri of the job. See this doc for more details.

    But IMHO, a better way to determine if a job has finished is through gcloud dataproc jobs describe <job-id> 1, or the jobs.get REST API 2.