Search code examples
pentahoetlkettle

Update multiple Excel sheets of one document within one Pentaho Kettle transformation


I am researching standard sample from Pentaho DI package: GetXMLData - Read parent children rows. It reads separately from same XML input parent rows & children rows. I need to do the same and update two different sheets of the same MS Excel Documents.

GetXMLData - Read parent children rows sample

My understanding is that normal way to achieve it is to put first sequence in one transformation file with XML Output or Writer, second to the second one & at the end create job with chain from start, through 1st & 2nd transformations.

My problems are:

  • When I try to chain above sequences I loose content of first updated Excel sheet in the final document;
  • I need to have at the end just one file with either Job or Transformation without dependencies (In case of above proposed scenario I would have 1 KJB job + 2 KTR transformation files).

Questions are:

  • Is it possible to join 2 sequences from above sample with some wait node before starting update 2nd Excel sheet?
  • If above doesn't work: Is it possible to embed transformations to the job instead of referencing them from external files?
  • And extra question: What is better to use: Excel Output or Excel Writer?

=================

UPDATE:

Based on @AlainD proposal I have tried to put Block node in-between. Here is a result:

Looks like Block step can be an option, but somehow it doesn't work as expected with Excel Output / Writers node (or I do something wrong). What I have observed is that Pentaho tries to execute next after Block steps before Excel file is closed properly by the previous step. That leads to one of the following: I either get Excel file with one empty sheet or generated result file is malformed.

My input XML file (from Pentaho distribution) & test playground transformation are: HERE

NOTE: While playing do not forget to remove generated MS Excel files between runs.

Screenshot: Playground

Any suggestions how to fix my transformation?


Solution

  • The pattern goes as follow:

    • read data: 1 row per children, with the parent data in one or more column
    • group the data : 1 row per parent, forget the children, keep the parent data. Transform and save as needed.
    • back from the original data, lookup each row (children) and fetch the parent in the grouped data flow.
    • the result is one row per children and the needed column of the transformed parent. Transform and save as needed.

    It is a pattern, you may want to change the flow, and/or sort to speed up. But it will not lock, nor feed up the memory: the group by and lookup are pretty reliable.

    enter image description here