Search code examples
apache-sparklog4jgoogle-kubernetes-enginestackdrivergoogle-cloud-logging

Structured Logging with Apache Spark on Google Kubernetes Engine


I am running Apache Spark applications on a Google Kubernetes Engine Cluster which propagates any output from STDOUT and STDERR to Cloud Logging. However, granular log severity levels are not propagated. All messages will have only either INFO or ERROR severity in Cloud Logging (depending on whether it was written to stdout or stderr) and the actual severity level is hidden in a text property.

My goal is to format the messages in the Structured Logging JSON format so that the severity level is propagated to Cloud Logging. Unfortunately, Apache Spark still uses the deprecated log4j 1.x library for logging and I would like to know how to format log messages in a way that Cloud Logging can pick them up correctly.

So far, I am using the following default log4j.properties file:

log4j.rootCategory=INFO, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n

Solution

  • When enabling Cloud Logging in a GKE cluster, the logging is managed by GKE, so it’s not possible to change the format of the logs as easily as it’s in a GCE instance.

    To push JSON format logs in GKE, you can try the following options:

    1. Make your software push logs in JSON format, so Cloud Logging will detect JSON formatted log entries and push them in this format.

    2. Manage your own fluentd version as suggested in here and set up your own parser, but the solution becomes managed by you and no longer GKE.

    3. Adds a sidecar container that reads your logs and converts them to JSON, then dumps the JSON to stdout. The logging agent in GKE will ingest the sidecar's logs as JSON.

    Bear in mind that while using option three, there are some considerations that can lead to significant resource consumption and you won't be able to use kubectl logs as explained here.