In my development env on my laptop (Ubuntu box), I write my "pyspark code" and run it through pytest
to test. I inject the spark session that's created using SparkSession.builder.getOrCreate()
into my tests. This "pyspark code" has a bunch of log statements at info level. None of them show up anywhere, so I'm forced to write print()
, which I later remove.
All I want is:
My.*
)%t, %m, %n, %d{YYYY-MM-DD}, ...
I read:
log4j2.properties
. Could not find any docs on format of log4j2.properties
. Sample log4j2.properties.template
is VERY limited. NOTE that it says nothing about using xml
or other formats, just properties
.log4j2.properties
. It gives some samples, but those are useless e.g. where do I specify filename for a file appender (not rolling file appender).ConfigurationException: No type attribute provided for Appender my_console
).I don't want to figure it out by trial and error. Just looking for docs.
my_spark_module.py
class MySparkModule:
def __init__(self, spark):
self.log = spark.sparkContext._jvm.org.apache.log4j.LogManager.getLogger('My.SparkModule')
self.df = spark.createDataFrame([('data1', '2022-01-01T00:00:00'), ('data2', '2022-02-01T00:00:00')],
schema=['data', 'ts_column'])
self.log.info('exiting init()')
def get_count(self):
return self.df.count()
test_my_spark_module.py
import pytest
from pyspark.sql import SparkSession
@pytest.fixture(autouse=True, scope='module')
def spark_session():
return SparkSession.builder.getOrCreate()
class TestMySparkModule:
def test_tables(self):
spark_session = SparkSession.builder.getOrCreate()
log = spark_session.sparkContext._jvm.org.apache.log4j.LogManager.getLogger('My.TestMySparkModule')
log.info('something I wanted to log')
assert MySparkModule(spark_session).get_count() == 1, 'expected count to be 1'
Useful links:
To setup log4j2 logs from your spark session to /tmp/my-spark.log
and stdout, place this file in $SPARK_HOME/conf/log4j2.properties
.
We use poetry virtual env, so SPARK_HOME
was <project-root>/.venv/lib/python3.10/site-packages/pyspark/
and it had no conf
folder.
So I created conf folder and created <project-root>/.venv/lib/python3.10/site-packages/pyspark/conf/log4j2.properties
Any spark session started using <project-root>/.venv/bin/pyspark
will use this log4j2 config.
Change logger.spark_loggers.level = trace|debug|info
to change log level of Spark's loggers to get VERY detailed logs.
NOTE the comments inside the file.
# Reload config every X seconds, for testing out changes to
# this file without quiting your spark console session
# monitorInterval = 3 # seconds
property.log_pattern = %d{MM-dd HH:mm:ss.SSS} %5p [%t-%X{tid}] %c{*.1} - %m%n%exception{full}
appender.my_console.name = my_console
appender.my_console.type = Console
appender.my_console.target = SYSTEM_OUT
appender.my_console.layout.type = PatternLayout
appender.my_console.layout.pattern = ${log_pattern}
appender.my_file.name = my_file
appender.my_file.type = File
appender.my_file.filename = /tmp/my-spark.log
appender.my_file.layout.type = PatternLayout
appender.my_file.layout.pattern = ${log_pattern}
# Uncomment if you want to overwrite the log file everytime your program runs
# appender.my_file.Append=false
# For deploying Spark ThriftServer
# SPARK-34128: Suppress undesirable TTransportException warnings involved in THRIFT-4805
appender.my_console.filter.1.type = RegexFilter
appender.my_console.filter.1.regex = .*Thrift error occurred during processing of message.*
appender.my_console.filter.1.onMatch = deny
appender.my_console.filter.1.onMismatch = neutral
# Set all to goto the file
rootLogger.level = warn
# If you want to log everything to console, will be VERY noisy
# rootLogger.appenderRef.my_console.ref = my_console
rootLogger.appenderRef.my_file.ref = my_file
# update to change spark log level (all loggers with org.apache.spark prefix)
logger.spark_loggers.name = org.apache.spark
logger.spark_loggers.level = warn
logger.spark_loggers.additivity = false
# control loggers with com.kash.* prefix
logger.my_loggers.name = com.kash
logger.my_loggers.level = trace
# logger.my_loggers.appenderRef.my_console.ref = my_console
logger.my_loggers.appenderRef.my_file.ref = my_file
logger.my_loggers.additivity = false
# Set the default spark-shell/spark-sql log level to WARN. When running the
# spark-shell/spark-sql, the log level for these classes is used to overwrite
# the root logger's log level, so that the user can have different defaults
# for the shell and regular Spark apps.
logger.repl.name = org.apache.spark.repl.Main
logger.repl.level = warn
logger.thriftserver.name = org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver
logger.thriftserver.level = warn
# Settings to quiet third party logs that are too verbose
logger.jetty1.name = org.sparkproject.jetty
logger.jetty1.level = warn
logger.jetty2.name = org.sparkproject.jetty.util.component.AbstractLifeCycle
logger.jetty2.level = error
logger.replexprTyper.name = org.apache.spark.repl.SparkIMain$exprTyper
logger.replexprTyper.level = info
logger.replSparkILoopInterpreter.name = org.apache.spark.repl.SparkILoop$SparkILoopInterpreter
logger.replSparkILoopInterpreter.level = info
logger.parquet1.name = org.apache.parquet
logger.parquet1.level = error
logger.parquet2.name = parquet
logger.parquet2.level = error
# SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support
logger.RetryingHMSHandler.name = org.apache.hadoop.hive.metastore.RetryingHMSHandler
logger.RetryingHMSHandler.level = fatal
logger.FunctionRegistry.name = org.apache.hadoop.hive.ql.exec.FunctionRegistry
logger.FunctionRegistry.level = error