pentaho kettle pdi pentaho-spoon pentaho-data-integration

PDI/Kettle: avoid file creation or mapping (sub-transformation) execution

It's clear by now that all steps from a transformation are executed in parallel and there's no way to change this behavior in Pentaho.

Given that, we have a scenario with a switch task that checks a specific field (read from a filename) and decides which task (mapping - sub-transformation) will process that file. This is part of a generic logic that, before and after each mapping task, does some boilerplate tasks as updating DB records, sending emails, etc.

The problem is: if we have no "ACCC014" files, this transformation cannot be executed. I understand it's not possible, as all tasks are executed in parallel, so the second problem arises: inside SOME mappings, XML files are created. And even when Pentaho is executing this task with empty data, we can't find a way of avoiding the XML output file creation.

We thought about moving this switch logic to the job, as in theory it's serial, but found no conditional step that would do this kind of distinction.

We also looked to Meta Data Injection task, but we don't believe it's the way to go. Each sub-transformation does really different jobs. Some of them update some tables, other ones write files, other ones move data between different databases. All of them receive some file as input and return a send_email flag and a message string. Nothing else.

Is there a way to do what we are willing for? Or there is no way to reuse part of a logic based on default inputs/outputs?

Edit: adding ACCC014 transformation. Yes, the "Do not create file at start" option is checked.

Solution

You can use Transformation Executor step (http://wiki.pentaho.com/display/EAI/Transformation+Executor) in order to execute transformation conditionally. Though I haven’t really used this step, so I can’t say anything about it’s stability or performance.

Main transformation:

Set-up your parent transformation like this:
Regarding the Injector step: in 5.2 version, I was not able to get fields created in the sub-transformation even though they were defined on “result rows” tab, so I had to define all these fields in the Injector step instead. Not sure, if it is still necessary in current version.

Possible adjustments for Transformation Executor:

Probably, you’d want to change The number of rows to send to the transformation value on Row grouping tab: set it to 0 in order to send all rows at once instead of re-executing the transformation for every N rows.
If you want to read output of your sub-transformation, select “This output will contain the result rows after execution” option while creating the hop to the subsequent step:

Sub-transformation:

The only change you'll probably need here is to replace your mapping input and output by Get rows from result and Copy rows to result:

Known issue in 5.2: It seems like the job executor reads the output of sub-transformation not from the “Copy rows to result” step, but from the most recently created step. So, if you have added some steps to your sub-transformation, remember to re-create the step, from which you expect to read the output: just select the “Copy rows to result”, cut it, paste it back and re-create the hop.