pentaho etl business-intelligence kettle data-integration

Reusing transformations with different data in Pentaho data integration Kettle

I'm working with Pentaho Kettle (PDI) and i'm trying to manage a flow in where there are a few transformations which should work like those where functions. I'll be more specific. I've created some transformation that make some modify on a few fields of some csv file. Every transformation acts just on one field of the csv file. So the first transformation should modify values, for example, just from the first column of the file, the second transformation should works on another column, and so on. Since a spent time creating every single transformation, i would like to have those reusable for other jobs/transformation working with the same kind of values. If you want an example i've created a tranformation which make quality improvement on phone numbers (and many others). Here's a "general" idea of a main job: enter image description here

My problem here is about passing data trough the transformations. To do that, every time, i put data in the result table, using the "Copy rows from result" step. After having done all the modify i put data in the result table using the "Put rows to result" step. Here just a sample (of course the real transformations are more complicated than this one).

enter image description here

As you probably know, we have to specify the coming fields in the "Copy rows from result", so if i have to use this transformation in another job/transformation which works with differet file i have to change the schema of the "Copy rows from result" step.

May be there's a different way to move the data flow, which could be easier than this. I've also considered the use of parameters, but i don't know if it's possible to pass them, using fields coming from the result tables. And here's the other question: "is the result table the only way to return values from a transformation ?"

I've also considered to execute all the transformation in parallel, inside of a transformation, passing them just the interested value and a key, and then to fuse all single fields with a "merge join step". This one as also a synchronization problem. So there's anyone who knows a good way to solve this problem ? ... i think that it exist a standard method to do all this ...

Solution

The solution to my problem is based on the use of the "Mapping (sub-transformation)" step. Insted of working in a job, we can call all the transformation inside of another transformation, and call those with the "Mapping (sub-transformation)" step. Here's a sample:

enter image description here

On each step of this kind we have to specify the input fields, that we want to modify. We can pass just those. Here's an example of the "Input" tab of this kind of step:

enter image description here

As you can see we have to specify the field as it's called in the main transformation, and we can change it to adapt it at the sub transformation (in this case the field "phone" become "PHONE"). We have to specify also the output fields in the "Output" tab, in the same way we've done for the inputs.

The sub transformation looks as the follow:

enter image description here

To get the incoming field you have to use the "Mapping input specification" step, and to put the modified fields in output you have to use the "Mapping output specification". In the "Mapping input specification" you have to specify which are the incoming fields, that will be the same all the time you're going to use this transformation from now. The adaptation to this fields should be done outside, in the main transformation, so you can reuse the sub-transformation don't changing anything.