Oozie distributed scheduler

I am going through the oozie documentation and i understood that it is a distributed workflow scheduler .

Is it capable of scheduling workflows on the cluster where the oozie job has been submitted ? to rephrase, the oozie is capable of scheduling jobs or running scripts on any random node in the cluster, is it capable of taking an action on the client machine / edge node / other cluster ( for instance distcp ) .

Solution

Oozie itself is not distributed; the service runs on an "edge node" (a machine that has all the Hadoop libraries and config but does not run jobs or store HDFS files) and uses a database, typically MySQL, to store all jobs definition and state.

Oozie coordinators define when and how a workflow must be triggered.

Oozie workflows are Direct Acyclic Graphs (DAG) i.e. chains of simple steps - some steps can be executed in parallel, the chaining of steps can be conditional, but there are no loops (that's what the A means in DAG).

Some trivial steps (e.g. send an e-mail) are done directly by Oozie, but all the rest is translated into YARN jobs - and then YARN runs these jobs on random modes. These jobs can be really "distributed", or not (e.g. a Shell Action is translated into a single Mapper, that runs the Oozie bootstrap JAR, that runs a shell interpreter, that runs the script provided -- in the end that's parallel processing of just 1 process... duh)

Note that a single Oozie service can run jobs on multiple clusters, that's why each workflow must specify a NameNode and a JobTracker (a ResourceManager actually with YARN)

You may want to browse that old, but comprehensive tutorial in 14 chapters: http://hadooped.blogspot.fr/2013/06/apache-oozie-part-1-workflow-with-hdfs.html