Oozie has a rich set of directives to describe the desired flow of control between tasks. Does it have anything that assists in passing data between those tasks? Or is passing data an exercise left entirely to the user?
Update: I'm using shell actions to invoke spark, so I need a solution which is sufficiently general to encompass that use case.
In order to pass data between Oozie Workflow Tasks, you will need to define your input for workflow2 to be the output of workflow1. For example:
<workflow-app xmlns='uri:oozie:workflow:0.1' name='demo-wf'>
<start to="workflow1" />
<action name="workflow1">
<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.input.dir</name>
<value>${workflow1_Input}</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>${workflow1_Output}</value>
</property>
</configuration>
</map-reduce>
<ok to="workflow2" />
<error to="fail" />
</action>
<action name="workflow2">
<map-reduce>
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.input.dir</name>
<value>${workflow1_Output}</value>
</property>
<property>
<name>mapred.output.dir</name>
<value>${workflow2_Output}</value>
</property>
</configuration>
</map-reduce>
<ok to="done" />
<error to="fail" />
</action>
</workflow-app>
Note that I left a bunch of details out for the map-reduce jobs, only showing the input and output. When you setup your properties file, you can define the input and output parameters. Another thing to be aware of is MapReduce will ignore anything with an underscore _
as input. So when the first MapReduce job completes it will have a _SUCCESS file and a _log directory which will be ignored as input for the second action.