I'm reading an excel sheet using Kettle and this sheet has three fields: code, description and date.
Example:
1 - description A - 01/JAN/2013
2 - description A - 15/JAN/2013
3 - description A - 04/JAN/2013
4 - description B - 02/JAN/2013
5 - description B - 16/JAN/2013
6 - description B - 11/JAN/2013
What I want is to put an additional field Max date for each row, where max date is the latest date on a group based on description, like this:
Example 2
1 - description A - 01/JAN/2013 - 15/JAN/2013
2 - description A - 15/JAN/2013 - 15/JAN/2013
3 - description A - 04/JAN/2013 - 15/JAN/2013
4 - description B - 02/JAN/2013 - 16/JAN/2013
5 - description B - 16/JAN/2013 - 16/JAN/2013
6 - description B - 11/JAN/2013 - 16/JAN/2013
The group of rows with description A has the latest date 15/JAN/2013 (because of row 2). The group of rows with description B has the latest date 16/JAN/2013 (because of row 5).
This is the perfect job for the row denormalizer step: