Search code examples
sqlpostgresqlhbaseetltalend

How to avoid inserting duplicate records


I am running my Talend job in Windows Task Scheduler with interval of 15 minutes. The process is like exporting data from HBase into PostgreSQL. So when I'm running the task, the 2nd schedule reinserts the records again from 1st schedule and so on.

HBase schema -> id int, name string
PostgreSQL schema -> id int, name varchar(100),created index on (id) column.

Example :

schedule insert

1st schedule       2nd schedule

`id``name`          `id` `name`

1    abcd            4    bbbb
2    efgh            5    cccc
3    hjkl            6    eeee

my output in POSTGRES :     EXPECTED output :
afer scheduling 

id   name                   id      name

1    abcd                    1      abcd
2    efgh                    2      efgh
3    hjkl                    3      hjkl
1    abcd                    4      bbbb
2    efgh                    5      cccc
3    hjkl                    6      eeee
4    bbbb
5    cccc
6    eeee

Thanks in advance !


Solution

  • You have to use your postgresql target table as a look up and check for the existing data. Your flow should be as below,

    source --> Expression --> Target
    
                Lookup(to check existing data)     
    

    Your flow should be as below,

    enter image description here

    Let me know if you need more assistance on this. This is a quick and easy task