We have a complex environment which compute daily tasks with several technologies: SPARK PY-SPARK Java MapReduce and HIVE.
Recently we integrated a new system which make dynamic resolution for services in runtime. This system updates an environment variable(dynamically) before task is initialize.
There is a library which read to environment variable and do stuff with it(unrelevant). Therefore each task needs this env variable on its executor/mapper/reducer environment.
Our tasks managed by YARN Resource manager.
To Sumup I want to pass YARN env variables which it will expose on all of it's containers (ApplicationMaster and executors/mappers/reducers).
Things I tried so far:
SPARK - I played with :
spark-submit --conf spark.yarn.appMasterEnv.KEY=Value
This actually expose the env variable to the application master but not on the executors so if a UDF function will try to find it, it will fail.
A possible solution for that is to use:
spark.executorEnv.[EnvironmentVariableName]
In MapReduce I'm a bit lost I didn't find a way to pass environment variable with
hadoop jar
The best I can do is to pass the variable on conf file and than expose it using java code. To expose it to the mappers/reducers I used:
mapreduce.map/reducer.env
This approach is not good for because it makes me modify all my MapReduce Jobs
So I decided to approach it through yarn containers. However after couple of days of experimenting I got zero results. So my question. Is there a way to manipulate yarn to initialize it's containers with my extra environment variable through spark-submit and hadoop jar
For instance
hadoop jar -Dyarn.expose.this.variable=value
I also be happy to accept answers if it only solve MapReduce in way that let me expose env variables without altering MapReduce code.
I think you are looking for these
yarn.app.mapreduce.am.env
mapreduce.map.env
mapreduce.reduce.env
Search for descriptions on https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/mapred-default.xml
Specifically, it says that if you set -Dmapreduce.map.env='A=foo', then it will set A environment variable to "foo"
And those will get passed to YARN containers.
This approach is not good for because it makes me modify all my MapReduce Jobs
I'm sure I understand how you'd avoid altering code otherwise. Some library needs to be modified to read the environment or otherwise defined properties
Recently we integrated a new system which make dynamic resolution for services in runtime
I think I've seen dynamic configurations setup with Zookeeper/Consul/Etcd; but I've not seen YARN environment specific things outside of Docker container labels, for example