Do not understand Filter Expressions on Lookup in TMap

I have a very simple job with 4 'contact' records, of which 2 of them have an identical email address

Now I try to find the records have an identical email record. So I load the contact records twice, then both attach them to a tmap, and use a lookup to match on emailaddress. Using filter expressions, I ensure that I don't compare records with themselves.

The result now is that only 1 of the duplicate emails is marked as 'duplicate' and the other records is NOT matched. Does anybody have an idea why?

Solution

This is because :

The Unique match option functions as a Last match. The First match and All matches options function as named.

So if we remove the input filter row1.id!=row2.id and just left join the 2 flows and show them, we will get:

|=-+------------------+----+-----------------=|
|id|mail              |id_1|mail_1            |
|=-+------------------+----+-----------------=|
|c1|[email protected]     |c1  |[email protected]     |
|c2|[email protected]    |c2  |[email protected]    |
|c3|[email protected]|c4  |[email protected]|
|c4|[email protected]|c4  |[email protected]|
'--+------------------+----+------------------'

Note that last 2 rows of the lookup flow does not have the row c3, because Talend fetched the last row that match [email protected] which is c4. Now if we filter that by row1.id!=row2.id we will get only the third row which is what you have got:

|=-+------------------+-----------=|
|id|mail              |isDuplicated|
|=-+------------------+-----------=|
|c1|[email protected]     |false       |
|c2|[email protected]    |false       |
|c3|[email protected]|true        |
|c4|[email protected]|false       |
'--+------------------+------------'

What we can do using only one tMap is to obtain all unique mail rows and all occurance of duplicated rows by enabling all match option.

|=-+------------------+----=|
|id|mail              |isDup|
|=-+------------------+----=|
|c1|[email protected]     |false|
|c2|[email protected]    |false|
|c3|[email protected]|false|
|c3|[email protected]|true |
|c4|[email protected]|true |
|c4|[email protected]|false|
'--+------------------+-----'

Then we can filter this output to get duplicated rows in addition to the initial flow, to fill your exact requirement i dont think we are obliged to join this output again like this :

To get this output:

.--+------------------.
|       unique        |
|=-+-----------------=|
|id|mail              |
|=-+-----------------=|
|c1|[email protected]     |
|c2|[email protected]    |
|c3|[email protected]|
|c4|[email protected]|
'--+------------------'

.--+------------------.
|     duplicated      |
|=-+-----------------=|
|id|mail              |
|=-+-----------------=|
|c3|[email protected]|
|c4|[email protected]|
'--+------------------'
.--+------------------+------------.
|           isDuplicated           |
|=-+------------------+-----------=|
|id|mail              |isDuplicated|
|=-+------------------+-----------=|
|c1|[email protected]     |false       |
|c2|[email protected]    |false       |
|c3|[email protected]|true        |
|c4|[email protected]|true        |
'--+------------------+------------'