Search code examples
mysqltalend

Remove duplicates from a MySQL table using Talend


Coming from SAS background, I am finding this basic task of deduping a dataset quite a chore with Talend. I am certain that it's to do with the lack of my experience with the tool.

I need to remove duplicates from a target dataset, after inserting the data. I can't seem to run multiple sql statements using tSortRow (MySQL) and I get the error - "You have an error in your SQL syntax". The set of queries work fine on the MySQL Workbench.

With SAS Data Management Studio, I could add a SAS code node and could run this sort procedure with noduplicates modifier. I was trying to do something similar by running the set of SQL queries, through tMySQLRow component.

How do you guys do that? All I need is the talend way to deduplicate the data in the datasource.


Solution

  • I've been able to de-duplicate table data by doing the following within a subjob:

    tDBInput --> tAggregate (count by value) --> tFilterRow (where count is greater than 1 for your duplicate fields)

    subjob

    Alternatively, since you're using MySQL, you should be able to remove the duplicates via SQL. Do you have a sample schema of the table?