python google-cloud-platform google-cloud-dataproc python-logging google-cloud-storage

Dataproc: PySpark logging to GCS Bucket

I have a pyspark job running in Dataproc. Currently, we are logging to console/yarn logs. As per our requirement, we need to store the logs in GCS bucket. Is there a way to directly log to files in GCS Bucket with python logging module?

I have tried to set the logging module with below config. But it's throwing an error (FileNotFoundError: [Errno 2] No such file or directory: '/gs:/bucket_name/newfile.log')

logging.basicConfig(filename="gs://bucket_name/newfile.log", format='%(asctime)s %(message)s', filemode='w')

Solution

By default, yarn:yarn.log-aggregation-enable is set to true and yarn:yarn.nodemanager.remote-app-log-dir is set to gs://<cluster-tmp-bucket>/<cluster-uuid>/yarn-logs on Dataproc 1.5+, so YARN container logs are aggregated in the GCS dir, but you can update it with

gcloud dataproc clusters create ... \
  --properties yarn:yarn.nodemanager.remote-app-log-dir=<gcs-dir>

or update the tmp bucket of the cluster with

gcloud dataproc clusters create ... --temp-bucket <bucket>

Note that

If your Spark job is in client mode (the default), the Spark driver runs on master node instead of in YARN, driver logs are stored in the Dataproc-generated job property driverOutputResourceUri which is a job specific folder in the cluster's staging bucket. Otherwise, in cluster mode, the Spark driver runs in YARN, the driver logs are YARN container logs and are aggregated as described above.
If you want to disable Cloud Logging for your cluster, set dataproc:dataproc.logging.stackdriver.enable=false. But note that it will disable all types of Cloud Logging logs including YARN container logs, startup and service logs.