The below used to generated only required amount of logging to us in spark 2.2 version. However after moving to Spark 3.3 the log4j.properties is no longer respected. A lot of Spark Trace and debug info is being printed.
I hear that its because spark moved to log4j2 from log4j. Finally in spite of research it's still not clear how to configure log4j across all the drivers and executors during the Spark submit for Spark 3.3.
The command that worked beautifully in Spark 2.2
spark-submit --conf "spark.executor.extraJavaOptions=-Dlog4j.debug=true" --conf "spark.driver.extraJavaOptions=-Dlog4j.debug=true" --files /home/hadoop/log4j.properties --name app --master yarn --deploy-mode cluster --class a.b.c.Entrypoint /home/hadoop/jars/app.jar
So the questions:
Based on the comments I made the below changes. But I see a lot of spark internal data getting logged -- not just my data alone
spark-submit --driver-memory 1g --executor-memory 2g --conf "spark.driver.extraJavaOptions=-Dlog4j2.debug=true --files /home/hadoop/log4j2.properties --master yarn --deploy-mode cluster --class com.a.b.ABC /home/hadoop/jars/spark-1.0-SNAPSHOT.jar
log4j2.properties
status=warn
name=campV2
appender.console.type = Console
appender.console.name = console
appender.console.layout.type = PatternLayout
appender.console.layout.pattern = %d{yy-MM-dd HH:mm:ss} %p %c: %m%n%ex
rootLogger.level = warn
rootLogger.appenderRef.stdout.ref = console
logger.app.name=com.company1
logger.app.level = debug
logger.app.additivity = false
logger.app.appenderRef.console.ref = console
logger.app2.name=com.company2
logger.app2.level = debug
logger.app2.additivity = false
logger.app2.appenderRef.console.ref = console
The logs generated with unwanted data
LogLastModifiedTime:Tue Dec 20 05:52:31 +0000 2022
LogLength:36546
LogContents:
ls -l:
total 20
lrwxrwxrwx 1 yarn yarn 62 Dec 20 05:52 __app__.jar -> /mnt/yarn/usercache/hadoop/filecache/23/spark-1.0-SNAPSHOT.jar
lrwxrwxrwx 1 yarn yarn 58 Dec 20 05:52 __spark_conf__ -> /mnt/yarn/usercache/hadoop/filecache/21/__spark_conf__.zip
lrwxrwxrwx 1 yarn yarn 78 Dec 20 05:52 __spark_libs__ -> /mnt1/yarn/usercache/hadoop/filecache/22/__spark_libs__7763583720024624816.zip
-rw-r--r-- 1 yarn yarn 93 Dec 20 05:52 container_tokens
-rwx------ 1 yarn yarn 646 Dec 20 05:52 default_container_executor.sh
...
...
echo "broken symlinks(find -L . -maxdepth 5 -type l -ls):" 1>>"/var/log/hadoop-yarn/containers/application_1671425963628_0204/container_1671425963628_0204_01_000003/directory.info"
find -L . -maxdepth 5 -type l -ls 1>>"/var/log/hadoop-yarn/containers/application_1671425963628_0204/container_1671425963628_0204_01_000003/directory.info"
echo "Launching container"
exec /bin/bash -c "LD_LIBRARY_PATH=\"/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/docker/usr/lib/hadoop/lib/native:/docker/usr/lib/hadoop-lzo/lib/native:$LD_LIBRARY_PATH\" $JAVA_HOME/bin/java -server -Xmx2048m '-verbose:gc' '-XX:+PrintGCDetails' '-XX:+PrintGCDateStamps' '-XX:OnOutOfMemoryError=kill -9 %p' '-XX:+IgnoreUnrecognizedVMOptions' '--add-opens=java.base/java.lang=ALL-UNNAMED' '--add-opens=java.base/java.lang.invoke=ALL-UNNAMED' '--add-opens=java.base/java.lang.reflect=ALL-UNNAMED' '--add-opens=java.base/java.io=ALL-UNNAMED' '--add-opens=java.base/java.net=ALL-UNNAMED' '--add-opens=java.base/java.nio=ALL-UN
...
...
DEBUG StatusLogger PluginManager 'Lookup' found 16 plugins
DEBUG StatusLogger PluginManager 'Lookup' found 16 plugins
DEBUG StatusLogger Using configurationFactory org.apache.logging.log4j.core.config.ConfigurationFactory$Factory@6bedbc4d
TRACE StatusLogger Trying to find [log4j2-test18b4aac2.properties] using context class loader sun.misc.Launcher$AppClassLoader@18b4aac2.
TRACE StatusLogger Trying to find [log4j2-test18b4aac2.properties] using sun.misc.Launcher$AppClassLoader@18b4aac2 class loader.
Now with so many unwanted logs getting generated, finding my logs is like finding a needle in hay-stack. Is there a way to just display my logs and not the Spark internal logs?
So the question remains: how to configure the log4j2 so that I get to see only my loggers?
Set log4j2.debug=false and TRACE logs are gone now. However I still see script outputs
--conf "spark.driver.extraJavaOptions=-Dlog4j.debug=false -Dlog4j2.debug=false
echo "Setting up job resources"
ln -sf -- "/mnt/yarn/usercache/hadoop/filecache/3758/__spark_libs__3245215202131718232.zip" "__spark_libs__"
ln -sf -- "/mnt/yarn/usercache/hadoop/filecache/3760/log4j2.properties" "log4j2.properties"
ln -sf -- "/mnt/yarn/usercache/hadoop/filecache/3759/spark-1.0-SNAPSHOT.jar" "__app__.jar"
ln -sf -- "/mnt/yarn/usercache/hadoop/filecache/3757/__spark_conf__.zip" "__spark_conf__"
ln -sf -- "/mnt/yarn/usercache/hadoop/filecache/3756/hudi-defaults.conf" "hudi-defaults.conf"
echo "Copying debugging information"
# Creating copy of launch script
I am not sure how to solve this.
Finally after trying several options the below one only is working.
log4j2.properties
--> located at /usr/lib/spark/conf/log4j2.properties
Its disappointing to know that something that was working beautifully in spark 2.2 with the flag --files log4j.properties
doesnt work on upgrading the spark( --files log4j2.properties
) and we have to do a ugly fix by editing files on server-files.
My log4j2.properties looks like follows
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Set everything to be logged to the console
rootLogger.level = warn
rootLogger.appenderRef.stdout.ref = STDOUT
appender.console.type = Console
appender.console.name = STDOUT
appender.console.target = SYSTEM_OUT
appender.console.layout.type = PatternLayout
appender.console.layout.pattern = %d{yy-MM-dd HH:mm:ss} %p %c{1}: %m%n%ex
logger.pp.name = com.company
logger.pp.level = debug
logger.pp.additivity = false
logger.pp.appenderRef.console.ref=STDOUT
logger.pp1.name = com.company2
logger.pp1.level = debug
logger.pp1.additivity = false
logger.pp1.appenderRef.console.ref=STDOUT
# Settings to quiet third party logs that are too verbose com.amazonaws.services.s3
logger.jetty.name = org.sparkproject.jetty
logger.jetty.level = warn
logger.jetty2.name = org.sparkproject.jetty.util.component.AbstractLifeCycle
logger.jetty2.level = error
logger.repl1.name = org.apache.spark.repl.SparkIMain$exprTyper
logger.repl1.level = info
logger.repl2.name = org.apache.spark.repl.SparkILoop$SparkILoopInterpreter
logger.repl2.level = info
# Set the default spark-shell log level to WARN. When running the spark-shell, the
# log level for this class is used to overwrite the root logger's log level, so that
# the user can have different defaults for the shell and regular Spark apps.
logger.repl.name = org.apache.spark.repl.Main
logger.repl.level = warn
# SPARK-9183: Settings to avoid annoying messages when looking up nonexistent UDFs
# in SparkSQL with Hive support
logger.metastore.name = org.apache.hadoop.hive.metastore.RetryingHMSHandler
logger.metastore.level = fatal
logger.hive_functionregistry.name = org.apache.hadoop.hive.ql.exec.FunctionRegistry
logger.hive_functionregistry.level = error
# Parquet related logging
logger.parquet.name = org.apache.parquet.CorruptStatistics
logger.parquet.level = error
logger.parquet2.name = parquet.CorruptStatistics
logger.parquet2.level = error