Search code examples
talend

Do not understand Filter Expressions on Lookup in TMap


I have a very simple job with 4 'contact' records, of which 2 of them have an identical email address

mini_Knipsel2.png mini_Knipsel3.png

Now I try to find the records have an identical email record. So I load the contact records twice, then both attach them to a tmap, and use a lookup to match on emailaddress. Using filter expressions, I ensure that I don't compare records with themselves.

mini_Knipsel.png.png

The result now is that only 1 of the duplicate emails is marked as 'duplicate' and the other records is NOT matched. Does anybody have an idea why?

enter image description here


Solution

  • This is because :

    The Unique match option functions as a Last match. The First match and All matches options function as named.

    So if we remove the input filter row1.id!=row2.id and just left join the 2 flows and show them, we will get:

    |=-+------------------+----+-----------------=|
    |id|mail              |id_1|mail_1            |
    |=-+------------------+----+-----------------=|
    |c1|[email protected]     |c1  |[email protected]     |
    |c2|[email protected]    |c2  |[email protected]    |
    |c3|[email protected]|c4  |[email protected]|
    |c4|[email protected]|c4  |[email protected]|
    '--+------------------+----+------------------'
    

    Note that last 2 rows of the lookup flow does not have the row c3, because Talend fetched the last row that match [email protected] which is c4. Now if we filter that by row1.id!=row2.id we will get only the third row which is what you have got:

    |=-+------------------+-----------=|
    |id|mail              |isDuplicated|
    |=-+------------------+-----------=|
    |c1|[email protected]     |false       |
    |c2|[email protected]    |false       |
    |c3|[email protected]|true        |
    |c4|[email protected]|false       |
    '--+------------------+------------'
    

    What we can do using only one tMap is to obtain all unique mail rows and all occurance of duplicated rows by enabling all match option.

    |=-+------------------+----=|
    |id|mail              |isDup|
    |=-+------------------+----=|
    |c1|[email protected]     |false|
    |c2|[email protected]    |false|
    |c3|[email protected]|false|
    |c3|[email protected]|true |
    |c4|[email protected]|true |
    |c4|[email protected]|false|
    '--+------------------+-----'
    

    Then we can filter this output to get duplicated rows in addition to the initial flow, to fill your exact requirement i dont think we are obliged to join this output again like this : enter image description here

    To get this output:

    .--+------------------.
    |       unique        |
    |=-+-----------------=|
    |id|mail              |
    |=-+-----------------=|
    |c1|[email protected]     |
    |c2|[email protected]    |
    |c3|[email protected]|
    |c4|[email protected]|
    '--+------------------'
    
    .--+------------------.
    |     duplicated      |
    |=-+-----------------=|
    |id|mail              |
    |=-+-----------------=|
    |c3|[email protected]|
    |c4|[email protected]|
    '--+------------------'
    .--+------------------+------------.
    |           isDuplicated           |
    |=-+------------------+-----------=|
    |id|mail              |isDuplicated|
    |=-+------------------+-----------=|
    |c1|[email protected]     |false       |
    |c2|[email protected]    |false       |
    |c3|[email protected]|true        |
    |c4|[email protected]|true        |
    '--+------------------+------------'