Configuring log4j with IDE / pyspark shell to log to console and file using properties file

In my development env on my laptop (Ubuntu box), I write my "pyspark code" and run it through pytest to test. I inject the spark session that's created using SparkSession.builder.getOrCreate() into my tests. This "pyspark code" has a bunch of log statements at info level. None of them show up anywhere, so I'm forced to write print(), which I later remove.

All I want is:

Ability to set log level for all my loggers (My.*)
Write to console and file (appenders)
Some docs that describe the format specs. %t, %m, %n, %d{YYYY-MM-DD}, ...

I read:

Configuring Logging in official documentation. It just says add log4j2.properties. Could not find any docs on format of log4j2.properties. Sample log4j2.properties.template is VERY limited. NOTE that it says nothing about using xml or other formats, just properties.
All official log4j2 docs point to the same place, which has no docs on contents of log4j2.properties. It gives some samples, but those are useless e.g. where do I specify filename for a file appender (not rolling file appender).
I even tried the log4j (not log4j2) docs, this clearly doesn't work (ConfigurationException: No type attribute provided for Appender my_console).

I don't want to figure it out by trial and error. Just looking for docs.

my_spark_module.py

class MySparkModule:
    def __init__(self, spark):
        self.log = spark.sparkContext._jvm.org.apache.log4j.LogManager.getLogger('My.SparkModule')
        self.df = spark.createDataFrame([('data1', '2022-01-01T00:00:00'), ('data2', '2022-02-01T00:00:00')],
            schema=['data', 'ts_column'])
        self.log.info('exiting init()')

    def get_count(self):
        return self.df.count()

test_my_spark_module.py

import pytest
from pyspark.sql import SparkSession

@pytest.fixture(autouse=True, scope='module')
def spark_session():
    return SparkSession.builder.getOrCreate()


class TestMySparkModule:
    def test_tables(self):
        spark_session = SparkSession.builder.getOrCreate()
        log = spark_session.sparkContext._jvm.org.apache.log4j.LogManager.getLogger('My.TestMySparkModule')
        log.info('something I wanted to log')
        
        assert MySparkModule(spark_session).get_count() == 1, 'expected count to be 1'

Solution

Useful links:

Log4j2 configuration's properties file section (read slowly and carefully)
Full list of what you can configure.
Sample file that is supposed to come with your spark installation. But if you're doing it via pip/virtualenv/poetry, it may not.
PatterLayout reference.

To setup log4j2 logs from your spark session to /tmp/my-spark.log and stdout, place this file in $SPARK_HOME/conf/log4j2.properties.

We use poetry virtual env, so SPARK_HOME was <project-root>/.venv/lib/python3.10/site-packages/pyspark/ and it had no conf folder. So I created conf folder and created <project-root>/.venv/lib/python3.10/site-packages/pyspark/conf/log4j2.properties

Any spark session started using <project-root>/.venv/bin/pyspark will use this log4j2 config.

Change logger.spark_loggers.level = trace|debug|info to change log level of Spark's loggers to get VERY detailed logs.

NOTE the comments inside the file.

# Reload config every X seconds, for testing out changes to
# this file without quiting your spark console session
# monitorInterval = 3  # seconds

property.log_pattern = %d{MM-dd HH:mm:ss.SSS} %5p [%t-%X{tid}] %c{*.1} - %m%n%exception{full}

appender.my_console.name = my_console
appender.my_console.type = Console
appender.my_console.target = SYSTEM_OUT
appender.my_console.layout.type = PatternLayout
appender.my_console.layout.pattern = ${log_pattern}

appender.my_file.name = my_file
appender.my_file.type = File
appender.my_file.filename = /tmp/my-spark.log
appender.my_file.layout.type = PatternLayout
appender.my_file.layout.pattern = ${log_pattern}
# Uncomment if you want to overwrite the log file everytime your program runs
# appender.my_file.Append=false

# For deploying Spark ThriftServer
# SPARK-34128: Suppress undesirable TTransportException warnings involved in THRIFT-4805
appender.my_console.filter.1.type = RegexFilter
appender.my_console.filter.1.regex = .*Thrift error occurred during processing of message.*
appender.my_console.filter.1.onMatch = deny
appender.my_console.filter.1.onMismatch = neutral


# Set all to goto the file
rootLogger.level = warn
# If you want to log everything to console, will be VERY noisy
# rootLogger.appenderRef.my_console.ref = my_console
rootLogger.appenderRef.my_file.ref = my_file

# update to change spark log level (all loggers with org.apache.spark prefix)
logger.spark_loggers.name = org.apache.spark
logger.spark_loggers.level = warn
logger.spark_loggers.additivity = false

# control loggers with com.kash.* prefix
logger.my_loggers.name = com.kash
logger.my_loggers.level = trace
# logger.my_loggers.appenderRef.my_console.ref = my_console
logger.my_loggers.appenderRef.my_file.ref = my_file
logger.my_loggers.additivity = false

# Set the default spark-shell/spark-sql log level to WARN. When running the
# spark-shell/spark-sql, the log level for these classes is used to overwrite
# the root logger's log level, so that the user can have different defaults
# for the shell and regular Spark apps.
logger.repl.name = org.apache.spark.repl.Main
logger.repl.level = warn

logger.thriftserver.name = org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver
logger.thriftserver.level = warn

# Settings to quiet third party logs that are too verbose
logger.jetty1.name = org.sparkproject.jetty
logger.jetty1.level = warn
logger.jetty2.name = org.sparkproject.jetty.util.component.AbstractLifeCycle
logger.jetty2.level = error
logger.replexprTyper.name = org.apache.spark.repl.SparkIMain$exprTyper
logger.replexprTyper.level = info
logger.replSparkILoopInterpreter.name = org.apache.spark.repl.SparkILoop$SparkILoopInterpreter
logger.replSparkILoopInterpreter.level = info
logger.parquet1.name = org.apache.parquet
logger.parquet1.level = error
logger.parquet2.name = parquet
logger.parquet2.level = error

# SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs in SparkSQL with Hive support
logger.RetryingHMSHandler.name = org.apache.hadoop.hive.metastore.RetryingHMSHandler
logger.RetryingHMSHandler.level = fatal
logger.FunctionRegistry.name = org.apache.hadoop.hive.ql.exec.FunctionRegistry
logger.FunctionRegistry.level = error