Search code examples
hadoophadoop-yarnoozie

Oozie - Is there a way to have only a single instance of the java action executing on the entire cluster?


When I look at my logs, I see that my oozie java actions are actually running on multiple machines.

I assume that is because they're wrapped inside m/r job? (is this correct)

Is there a way to have only a single instance of the java action executing on the entire cluster?


Solution

  • The Java action runs inside an Oozie "launcher" job, with just one YARN "map" container.

    The trick is that every YARN job requires an application master (AM) container for coordination.
    So you end up with 2 containers, _0001 for the AM and _0002 for the Oozie action, probably on different machines.

    To control the resource allocation for each one, you can set the following Action properties to override your /etc/hadoop/conf/*-site.xml config and/or hard-coded defaults (which are specific to each version and each distro, by the way):

    • oozie.launcher.yarn.app.mapreduce.am.resource.mb
    • oozie.launcher.yarn.app.mapreduce.am.command-opts (to align the max heap size with the global memory max)
    • oozie.launcher.mapreduce.map.memory.mb
    • oozie.launcher.mapreduce.map.java.opts (...)
    • oozie.launcher.mapreduce.job.queuename (in case you've got multiples queues with different priorities)


    Well, actually, the explanation above is not entirely true... On a HortonWorks distro you end up with 2 containers, as expected.
    But with a Cloudera distro, you typically end up with just one container, running both the AM and the action in the same Linux process.

    And I have no idea how they do that. Maybe there's a generic YARN config somewhere, maybe it's a Cloudera-specific feature.