Search code examples
excelduplicatesuniqueprimary-keytalend

Eliminate duplicates and Insert Unique records having max no. of column values present through Talend


enter image description here

I have an excel file which gets updated on a daily basis i.e the data is always different every time.

I am pulling the data from the Excel sheet into the table using Talend. I have a primary key Company_ID defined in the table.

The error I am facing is that the Excel sheet has few duplicate Company_ID values. It will also pick up more duplicate values in the future as the Excel file will be updated daily.

I want to choose the first record where the Company ID field is 1 and the record doesn't have null in the rest of the columns. Also, for a Company_ID of 3 there is a null value for one column which is ok since it is a unique record for that company_id.

How do I choose a unique row which has maximum no. of column values present ie for eg in the case of Company ID of 1 in Talend ?


Solution

  • tUniqRow is usually the easiest way to handle duplicates. If you are worried that the first row coming to tUniqRow may not be the first row that you want there, you can sort your rows, so they enter tUniqRow in your preferred order:

    DIAGRAM (used components: tFileInputExcel, tJavaRow, tSortRow, tUniqRow, tFilterColumns)

    In your particular case, the tJava could look like this:

    // Code generated according to input schema and output schema
    output_row.company_id = input_row.company_id;
    output_row.name       = input_row.name;
    output_row.et_cetera  = input_row.et_cetera;
    // End of pre-generated code
    
    int i = 0;
    if (input_row.company_id == null) { i++; }
    if (input_row.name       == null) { i++; }
    if (input_row.et_cetera  == null) { i++; }
    output_row.priority = i;