Search code examples
etlkettlepentaho-data-integration

Kettle: How to get max date for each group of rows


I'm reading an excel sheet using Kettle and this sheet has three fields: code, description and date.

Example:

1 - description A - 01/JAN/2013
2 - description A - 15/JAN/2013
3 - description A - 04/JAN/2013
4 - description B - 02/JAN/2013
5 - description B - 16/JAN/2013
6 - description B - 11/JAN/2013

What I want is to put an additional field Max date for each row, where max date is the latest date on a group based on description, like this:

Example 2

1 - description A - 01/JAN/2013 - 15/JAN/2013
2 - description A - 15/JAN/2013 - 15/JAN/2013
3 - description A - 04/JAN/2013 - 15/JAN/2013
4 - description B - 02/JAN/2013 - 16/JAN/2013
5 - description B - 16/JAN/2013 - 16/JAN/2013
6 - description B - 11/JAN/2013 - 16/JAN/2013

The group of rows with description A has the latest date 15/JAN/2013 (because of row 2). The group of rows with description B has the latest date 16/JAN/2013 (because of row 5).


Solution

  • This is the perfect job for the row denormalizer step:

    http://wiki.pentaho.com/display/EAI/Row+denormaliser