Search code examples
ssisssis-2008

Remove duplicate in SSIS package with preference over a column data


I have duplicate rows in data coming from excel sheet. In the SSIS package, I am using Sort transformation where sorting is done in ascending order by the primary key column ID. But before removing the duplicates I want to see if the email column has email with my company's domain. If so, I want other rows removed than the one having this type of email addresses. What should I do? Please refer to the image attached below. Image of data I have

In the data above, I want to remove two rows of John where email address are [email protected]. In Maria's case, I want to remove two rows having email addresses [email protected], hence preserving rows having email addresses of the domain mycompany.com. If there are multiple rows for a user having email addresses of the domain mycompany.com, I want to keep any one row with the domain email address.

Suggest please.


Solution

  • you can do that in sql like Kobi showed, that may be easier. But if you prefer in ssis:

    My test data:

    enter image description here

    enter image description here

    Some points:

    Conditional split: First you separate rows with mycompany and those without.

    Sort and non_mycompany sort: sort both output on id and remove duplicates.

    mycompany_multicast: create two copy of rows with mycompany

    Merge join: left join rows without mycompany to rows with mycompany. Note the join order, the purpose is to get rows without mycompany and no matching id in rows with mycompany.

    enter image description here

    Conditional split1: take rows without mycompany and no matching id in rows with mycompany. you can check id from rows with mycompany, if the id is null then the row has no matching in rows with mycompany.

    enter image description here

    union all: union the final result