Coming from SAS background, I am finding this basic task of deduping a dataset quite a chore with Talend. I am certain that it's to do with the lack of my experience with the tool.
I need to remove duplicates from a target dataset, after inserting the data. I can't seem to run multiple sql statements using tSortRow (MySQL) and I get the error - "You have an error in your SQL syntax". The set of queries work fine on the MySQL Workbench.
With SAS Data Management Studio, I could add a SAS code node and could run this sort procedure with noduplicates modifier. I was trying to do something similar by running the set of SQL queries, through tMySQLRow component.
How do you guys do that? All I need is the talend way to deduplicate the data in the datasource.
I've been able to de-duplicate table data by doing the following within a subjob:
tDBInput --> tAggregate (count by value) --> tFilterRow (where count is greater than 1 for your duplicate fields)
Alternatively, since you're using MySQL, you should be able to remove the duplicates via SQL. Do you have a sample schema of the table?