Search code examples
talend

What are Talend change data capture's pluses and deltas in a production EAI+big data integration scenario?


Talend appears to offer a unique integration of data integration (including big data), MDM, data services and ESB. It fits well with an architecture I am developing for a concurrent EAI+big data integration problem. The thought is to use Talend's change data capture feature to propagate event data from source to multiple targets, which include applications and data warehouses.

Has this been done in a production setting? If so, what are pluses and deltas? Thanks.


Solution

  • The change data capture stuff relies on database triggers to build a table in your source database of changes made to tracked source tables. Talend will automatically create these for you and then using the CDC components you can use it to easily read in the changes made.

    I have some experience of using this on a batch basis with a DI (data integration) job checking the CDC tables on run time and updating downstream systems or with any changes but I'm not sure how well it works, if at all, using Talend ESB to make this more real time as the mechanism is essentially just polling the CDC table rather than waiting for am event. You can of course set your DI job to poll every minute or even few seconds to make it a pseudo real time process. Some RDBMS (Oracle springs to mind) will allow you to call a web service on an event which would allow you to use this as a fata service but I'm always a little uncomfortable with the idea.

    I've put a small process using this into production but not in real time and as said, it does rely on being able to set triggers and create tables and insert and update data in your source database which may not be possible some cases where database changes are strictly controlled.

    The other option at that point is to pull your master source data into a shadow database and use that to populate downstream systems and keeping a hash of each row of the master source in your shadow source and compare a run time generated hash of each row in the master source to it to keep your shadow master up to date.