Search code examples
pythonjavaloggingpysparklog4j

Configuring log4j with IDE / pyspark shell to log to console and file using properties file


In my development env on my laptop (Ubuntu box), I write my "pyspark code" and run it through pytest to test. I inject the spark session that's created using SparkSession.builder.getOrCreate() into my tests. This "pyspark code" has a bunch of log statements at info level. None of them show up anywhere, so I'm forced to write print(), which I later remove.

All I want is:

  • Ability to set log level for all my loggers (My.*)
  • Write to console and file (appenders)
  • Some docs that describe the format specs. %t, %m, %n, %d{YYYY-MM-DD}, ...

I read:

  • Configuring Logging in official documentation. It just says add log4j2.properties. Could not find any docs on format of log4j2.properties. Sample log4j2.properties.template is VERY limited. NOTE that it says nothing about using xml or other formats, just properties.
  • All official log4j2 docs point to the same place, which has no docs on contents of log4j2.properties. It gives some samples, but those are useless e.g. where do I specify filename for a file appender (not rolling file appender).
  • I even tried the log4j (not log4j2) docs, this clearly doesn't work (ConfigurationException: No type attribute provided for Appender my_console).

I don't want to figure it out by trial and error. Just looking for docs.

my_spark_module.py

class MySparkModule:
    def __init__(self, spark):
        self.log = spark.sparkContext._jvm.org.apache.log4j.LogManager.getLogger('My.SparkModule')
        self.df = spark.createDataFrame([('data1', '2022-01-01T00:00:00'), ('data2', '2022-02-01T00:00:00')],
            schema=['data', 'ts_column'])
        self.log.info('exiting init()')

    def get_count(self):
        return self.df.count()

test_my_spark_module.py

import pytest
from pyspark.sql import SparkSession

@pytest.fixture(autouse=True, scope='module')
def spark_session():
    return SparkSession.builder.getOrCreate()


class TestMySparkModule:
    def test_tables(self):
        spark_session = SparkSession.builder.getOrCreate()
        log = spark_session.sparkContext._jvm.org.apache.log4j.LogManager.getLogger('My.TestMySparkModule')
        log.info('something I wanted to log')
        
        assert MySparkModule(spark_session).get_count() == 1, 'expected count to be 1'

Solution

  • Useful links:


    To setup log4j2 logs from your spark session to /tmp/my-spark.log and stdout, place this file in $SPARK_HOME/conf/log4j2.properties.

    We use poetry virtual env, so SPARK_HOME was <project-root>/.venv/lib/python3.10/site-packages/pyspark/ and it had no conf folder. So I created conf folder and created <project-root>/.venv/lib/python3.10/site-packages/pyspark/conf/log4j2.properties

    Any spark session started using <project-root>/.venv/bin/pyspark will use this log4j2 config.

    Change logger.spark_loggers.level = trace|debug|info to change log level of Spark's loggers to get VERY detailed logs.

    NOTE the comments inside the file.

    # Reload config every X seconds, for testing out changes to
    # this file without quiting your spark console session
    # monitorInterval = 3  # seconds
    
    property.log_pattern = %d{MM-dd HH:mm:ss.SSS} %5p [%t-%X{tid}] %c{*.1} - %m%n%exception{full}
    
    appender.my_console.name = my_console
    appender.my_console.type = Console
    appender.my_console.target = SYSTEM_OUT
    appender.my_console.layout.type = PatternLayout
    appender.my_console.layout.pattern = ${log_pattern}
    
    appender.my_file.name = my_file
    appender.my_file.type = File
    appender.my_file.filename = /tmp/my-spark.log
    appender.my_file.layout.type = PatternLayout
    appender.my_file.layout.pattern = ${log_pattern}
    # Uncomment if you want to overwrite the log file everytime your program runs
    # appender.my_file.Append=false
    
    # For deploying Spark ThriftServer
    # SPARK-34128: Suppress undesirable TTransportException warnings involved in THRIFT-4805
    appender.my_console.filter.1.type = RegexFilter
    appender.my_console.filter.1.regex = .*Thrift error occurred during processing of message.*
    appender.my_console.filter.1.onMatch = deny
    appender.my_console.filter.1.onMismatch = neutral
    
    
    # Set all to goto the file
    rootLogger.level = warn
    # If you want to log everything to console, will be VERY noisy
    # rootLogger.appenderRef.my_console.ref = my_console
    rootLogger.appenderRef.my_file.ref = my_file
    
    # update to change spark log level (all loggers with org.apache.spark prefix)
    logger.spark_loggers.name = org.apache.spark
    logger.spark_loggers.level = warn
    logger.spark_loggers.additivity = false
    
    # control loggers with com.kash.* prefix
    logger.my_loggers.name = com.kash
    logger.my_loggers.level = trace
    # logger.my_loggers.appenderRef.my_console.ref = my_console
    logger.my_loggers.appenderRef.my_file.ref = my_file
    logger.my_loggers.additivity = false
    
    # Set the default spark-shell/spark-sql log level to WARN. When running the
    # spark-shell/spark-sql, the log level for these classes is used to overwrite
    # the root logger's log level, so that the user can have different defaults
    # for the shell and regular Spark apps.
    logger.repl.name = org.apache.spark.repl.Main
    logger.repl.level = warn
    
    logger.thriftserver.name = org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver
    logger.thriftserver.level = warn
    
    # Settings to quiet third party logs that are too verbose
    logger.jetty1.name = org.sparkproject.jetty
    logger.jetty1.level = warn
    logger.jetty2.name = org.sparkproject.jetty.util.component.AbstractLifeCycle
    logger.jetty2.level = error
    logger.replexprTyper.name = org.apache.spark.repl.SparkIMain$exprTyper
    logger.replexprTyper.level = info
    logger.replSparkILoopInterpreter.name = org.apache.spark.repl.SparkILoop$SparkILoopInterpreter
    logger.replSparkILoopInterpreter.level = info
    logger.parquet1.name = org.apache.parquet
    logger.parquet1.level = error
    logger.parquet2.name = parquet
    logger.parquet2.level = error
    
    # SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support
    logger.RetryingHMSHandler.name = org.apache.hadoop.hive.metastore.RetryingHMSHandler
    logger.RetryingHMSHandler.level = fatal
    logger.FunctionRegistry.name = org.apache.hadoop.hive.ql.exec.FunctionRegistry
    logger.FunctionRegistry.level = error