I have a very simple job with 4 'contact' records, of which 2 of them have an identical email address
Now I try to find the records have an identical email record. So I load the contact records twice, then both attach them to a tmap, and use a lookup to match on emailaddress. Using filter expressions, I ensure that I don't compare records with themselves.
The result now is that only 1 of the duplicate emails is marked as 'duplicate' and the other records is NOT matched. Does anybody have an idea why?
This is because :
The Unique match option functions as a Last match. The First match and All matches options function as named.
So if we remove the input filter row1.id!=row2.id
and just left join the 2 flows and show them, we will get:
|=-+------------------+----+-----------------=|
|id|mail |id_1|mail_1 |
|=-+------------------+----+-----------------=|
|c1|[email protected] |c1 |[email protected] |
|c2|[email protected] |c2 |[email protected] |
|c3|[email protected]|c4 |[email protected]|
|c4|[email protected]|c4 |[email protected]|
'--+------------------+----+------------------'
Note that last 2 rows of the lookup flow does not have the row c3, because Talend fetched the last row that match [email protected]
which is c4.
Now if we filter that by row1.id!=row2.id
we will get only the third row which is what you have got:
|=-+------------------+-----------=|
|id|mail |isDuplicated|
|=-+------------------+-----------=|
|c1|[email protected] |false |
|c2|[email protected] |false |
|c3|[email protected]|true |
|c4|[email protected]|false |
'--+------------------+------------'
What we can do using only one tMap is to obtain all unique mail rows and all occurance of duplicated rows by enabling all match
option.
|=-+------------------+----=|
|id|mail |isDup|
|=-+------------------+----=|
|c1|[email protected] |false|
|c2|[email protected] |false|
|c3|[email protected]|false|
|c3|[email protected]|true |
|c4|[email protected]|true |
|c4|[email protected]|false|
'--+------------------+-----'
Then we can filter this output to get duplicated rows in addition to the initial flow, to fill your exact requirement i dont think we are obliged to join this output again like this :
To get this output:
.--+------------------.
| unique |
|=-+-----------------=|
|id|mail |
|=-+-----------------=|
|c1|[email protected] |
|c2|[email protected] |
|c3|[email protected]|
|c4|[email protected]|
'--+------------------'
.--+------------------.
| duplicated |
|=-+-----------------=|
|id|mail |
|=-+-----------------=|
|c3|[email protected]|
|c4|[email protected]|
'--+------------------'
.--+------------------+------------.
| isDuplicated |
|=-+------------------+-----------=|
|id|mail |isDuplicated|
|=-+------------------+-----------=|
|c1|[email protected] |false |
|c2|[email protected] |false |
|c3|[email protected]|true |
|c4|[email protected]|true |
'--+------------------+------------'