Search code examples
sql-servertalendgreenplumcdchawq

How do implement CDC in greenplum?


We have small array of gpdb and pivotal hadoop.We are trying to do CDC Using gpdb.

WE are using SQL server 2012 as external Data source. we have read only access of SQL SERVER from client's END. Hence ,SQL server does not allow to capture CDC mechanism for Data Integration tool.

We have to Implement CDC in greenplum. So that it brings only changed row Instead of Full Load every time. which also burdens SQL server on every schedules of Batch Job.

We tried on Following SQL

This is complete CDC query which decides action of each row

select * from (select *,case when ( a.ID is null) then 'I'
when ( a.ID is not null and  a.SalesOrder = a.SalesOrder and  a.SalesOrderLine = a.SalesOrderLine) then 'N'
when ( a.ID is not null and ( a.SalesOrder != a.SalesOrder or  a.SalesOrderLine != a.SalesOrderLine)) then 'U'
else 'X'
end as increment_action from tablename)

We also tried to understand using Link:

Similarly, I have tried on around dozen of link from Talend. But it is asking for pub/sub mechanism Which only supports when SQL server CDC feature is on. Hence Unable to do it from talend's way.

https://www.pivotalguru.com/?page_id=35

Is there any way to Implement Incremental Load Instead of Full Load in gpdb USING SQL. AND is there any way to convert SQL SERVER DDL syntax into Greenplum.?

Is there step by step documentation or guidance to implement CDC in gpdb ?

Any help on it would be much appreciated ?


Solution

  • Outsourcer does CDC and it is open source. You can either use Outsourcer or review the code to see how it is done and recreate it with your own code.

    Outsourcer Replication

    Source Code