I'm running Spark 3.4.1 tasks scheduled by Airflow 2.6.1. with a SparkSubmit Operator. Spark is running on Cluster Mode and therefore I don't have explicit logs from Spark Driver. Instead i have updates from spark_submit.py job which polls spark driver pod whether job is finished or not.
The Airflow logs is full of entries like the follows: [2024-07-29, 06:47:33 UTC] {spark_submit.py:523} INFO - 24/07/29 08:47:33 INFO LoggingPodStatusWatcherImpl: Application status for spark-dc8c170895df4383be2c6933606ee764 (phase: Running)
I would like to get rid off this INFO log entries of spark_submit.py module only (Python Modul: from airflow.providers.apache.spark.operators.spark_submit import SparkSubmitOperator) and I found following on apache airflow website: For airflow 2.6.1: https://airflow.apache.org/docs/apache-airflow/2.6.1/administration-and-deployment/logging-monitoring/logging-tasks.html#advanced-configuration For airflow 2.9.3: https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/advanced-logging-configuration.html# And examples: https://airflow.apache.org/docs/apache-airflow/stable/administration-and-deployment/logging-monitoring/advanced-logging-configuration.html#custom-logger-for-operators-hooks-and-tasks
Thanks!
I tried to apply this to my setup: Created log_config for SparkSubmit Operator:
from copy import deepcopy
from pydantic.utils import deep_update
from airflow.config_templates.airflow_local_settings import DEFAULT_LOGGING_CONFIG
LOGGING_CONFIG = deep_update(
deepcopy(DEFAULT_LOGGING_CONFIG),
{
"loggers": {
"airflow.providers.apache.spark.operators.spark_submit": {
"handlers": ["task"],
"level": "WARNING",
"propagate": True,
},
}
},
)
Then I added following line in airflow.cfg:
...
# Logging class
# Specify the class that will specify the logging configuration
# This class has to be on the python classpath
# Example: logging_config_class = my.path.default_local_settings.LOGGING_CONFIG
logging_config_class = log_conf.LOGGING_CONFIG
...
The files are stored as follows: airflow.cfg: /opt/airflow log_conf.py: /opt/airflow/config
I restarted the whole airflow application (airflow scheduler, airflow ui and postgres db running inside individual containers within a kubernetes pod) and i saw following log line:
[2024-07-29T09:24:45.193+0200] {logging_config.py:47} INFO - Successfully imported user-defined logging config from log_config.LOGGING_CONFIG
However, the INFO log level for spark_submit.py still appear even though i have changed the files above and restarted the whole airflow application
My questions:
I think the module to configure is airflow.providers.apache.spark.hooks.spark_submit
and not airflow.providers.apache.spark.operators.spark_submit
.
This is going by the fact that this test checks that the hook logs a message containing 'LoggingPodStatusWatcherImpl'.