Search code examples
pentahokettle

Kettle - Filter duplicated rows by field value


I have a single input stream from a csv file. This stream is from a User Table, where the user may be duplicated with different ID's.

I need to clean this stream. The combo of [Name + Surname] always is unique, but I need to make a comparison before selecting one, because I want to keep the information of the most recent and create a new field based on the comparison of other value.

I have a Timestamp that will be used for selecting the row.

This is a example of input stream: enter image description here

And this is a example of the output I want: enter image description here What tool would be the best for this task?

Thanks in advance.


Solution

  • This should work with a bit of sorting and grouping in pentaho kettle.

    1. CSV-File Input with your table (be sure that TimeUpdated is a date-format)
    2. Sort-Step: Sort by Name, Surname, Surname2, TimeUpdated (Asc: No)
    3. Group by-Step: Group by Name, Surname, Surname2; Aggregate: id (first value), TimeUpdated (first value), Status (Concatenate String seperated by (value: +)