Search code examples
hadoopoozieambaribigdata

Using ENV vars in distributed Hadoop cluster


My goal is to run applications on our Hadoop cluster without putting the explicit configuration into each app. Therefore, I am trying to put the configurations of the cluster into ENV variables and propagate them to each node in the cluster.

For example I define:

export HIVE2_JDBC_URL=jdbc:hive2://localhost:10000

to use it like this later on:

beeline -u $HIVE2_JDBC_URL/<db_name> -e "SELECT * FROM <table_name>;"

While this works for this specific use case (in the cli) it has two big drawbacks:

  • I have to manually update the ENV variables on each node on changes
  • Oozie workflows can not read from ENV variables

Is there a way to use Ambari to retrieve this settings and can I define my own custom settings that are then available on each node? Is there an approach that works also in Oozie workflows?


Solution

  • You can force "cluster-wide" environment variables via mapred-site.xml and yarn-site.xml -- but I'm not 100% sure which properties must be set in the configuration of the ResourceManager service, and/or every NodeManager service, and/or client nodes. And which level overrides (or adds to) which level. You will have to do some research & experimentation.

    Look into the documentation for mapred-default.xml and yarn-default.xml (e.g. here and here for Hadoop 2.7.0) for properties such as...

    mapred.child.env
    mapreduce.admin.user.env
    yarn.app.mapreduce.am.env
    yarn.app.mapreduce.am.admin.user.env
    yarn.nodemanager.admin-env
    yarn.nodemanager.env-whitelist
    

    [Edit] look also into these properties that have no proper entry in the "default" listings (yet another documentation bug...) and forget about the "mapred.child" stuff

    mapreduce.map.env 
    mapreduce.reduce.env 
    


    For Oozie jobs, there are two ways to set env. variables:

    • Shell actions have an explicit <env-var>VAR=VALUE</env-var> syntax, because shell scripts rely a lot on env. variables
    • all actions that use a "launcher" YARN job (i.e. Java, Pig, Sqoop, Spark, Hive, Hive2, Shell...) can benefit from a
        <property>
          <name>oozie.launcher.xxx.xxx.xxx.env</name><value>****</value>
        </property>
      to override the values in client config files that are mentioned above
    • MapReduce actions are launched directly, there is no "launcher" job, so the property would be set directly as
        <property>
          <name>xxx.xxx.xxx.env</name><value>****</value>
        </property>
    • in addition, the actions defined in the core Workflow schema (i.e. Java, Pig, MapReduce) can use the <global> section to define the property just once
      => alas, the other actions are defined as plug-ins with a distinct XML schema, and do not inherit the Global properties...

    Unfortunately the documentation for Oozie (e.g. here for Oozie 4.1) is completely silent about the oozie.launcher.* properties, you will have to make some research in Stack Overflow -- in that post for example.