Dataproc: configure Spark driver and executor log4j properties

As explained in previous answers, the ideal way to change the verbosity of a Spark cluster is changing the corresponding log4j.properties. However, on dataproc Spark runs on Yarn, therefore we have to adjust the global configuration and not /usr/lib/spark/conf

Several suggestions:

On dataproc we have several gcloud commands and properties we can pass during cluster creation. See documentation Is it possible to change the log4j.properties under /etc/hadoop/conf by specifying

--properties 'log4j:hadoop.root.logger=WARN,console'

Maybe not, as from the docs:

The --properties command cannot modify configuration files not shown above.

Another way would be to use a shell script during cluster init and run sed:

# change log level for each node to WARN
sudo sed -i -- 's/log4j.rootCategory=INFO, console/log4j.rootCategory=WARN, console/g'\
                     /etc/spark/conf/log4j.properties
sudo sed -i -- 's/hadoop.root.logger=INFO,console/hadoop.root.logger=WARN,console/g'\
                    /etc/hadoop/conf/log4j.properties

But is it enough or do we need to change the env variable hadoop.root.logger as well?

Solution

This answer is outdated as of Q3 2023

See this doc for the latest info for Dataproc.