apache-spark hadoop mapreduce hadoop-yarn

Passing environment variables to YARN containers

We have a complex environment which compute daily tasks with several technologies: SPARK PY-SPARK Java MapReduce and HIVE.

Recently we integrated a new system which make dynamic resolution for services in runtime. This system updates an environment variable(dynamically) before task is initialize.

There is a library which read to environment variable and do stuff with it(unrelevant). Therefore each task needs this env variable on its executor/mapper/reducer environment.

Our tasks managed by YARN Resource manager.

To Sumup I want to pass YARN env variables which it will expose on all of it's containers (ApplicationMaster and executors/mappers/reducers).

Things I tried so far:

SPARK - I played with :

spark-submit --conf spark.yarn.appMasterEnv.KEY=Value

This actually expose the env variable to the application master but not on the executors so if a UDF function will try to find it, it will fail.

A possible solution for that is to use:

spark.executorEnv.[EnvironmentVariableName]

In MapReduce I'm a bit lost I didn't find a way to pass environment variable with

hadoop jar

The best I can do is to pass the variable on conf file and than expose it using java code. To expose it to the mappers/reducers I used:

mapreduce.map/reducer.env

This approach is not good for because it makes me modify all my MapReduce Jobs

So I decided to approach it through yarn containers. However after couple of days of experimenting I got zero results. So my question. Is there a way to manipulate yarn to initialize it's containers with my extra environment variable through spark-submit and hadoop jar

For instance

hadoop jar -Dyarn.expose.this.variable=value

I also be happy to accept answers if it only solve MapReduce in way that let me expose env variables without altering MapReduce code.

Solution

I think you are looking for these

yarn.app.mapreduce.am.env
mapreduce.map.env
mapreduce.reduce.env

Search for descriptions on https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml

Specifically, it says that if you set -Dmapreduce.map.env='A=foo', then it will set A environment variable to "foo"

And those will get passed to YARN containers.

This approach is not good for because it makes me modify all my MapReduce Jobs

I'm sure I understand how you'd avoid altering code otherwise. Some library needs to be modified to read the environment or otherwise defined properties

Recently we integrated a new system which make dynamic resolution for services in runtime

I think I've seen dynamic configurations setup with Zookeeper/Consul/Etcd; but I've not seen YARN environment specific things outside of Docker container labels, for example