Proper pattern for funneling processing of multiple streams into a single stream

Right now I have one stream application in SCDF that is pulling data from multiple tables in a database and replicating it to another database. Currently, our goal is to reduce the amount of work that a given stream is doing, so we want to split the stream out into multiple streams and continue replicating the data into the second database.

Are there any recommended design patterns for funneling the processing of these various streams into one?

Solution

If I understand this requirement correctly, you'd want to split the ingest piece by DB/Table per App and then merge them all into a single "payload type" for downstream processing.

If you really do want to split the ingest by DB/Table, you can, but you may want to consider the pros/cons, though. One obvious benefit is granularity and that you can independently update the App in isolation, and maybe also reusability. Of course, it brings other challenges. Maintenance, fixes, and releases for individual apps to name a few.

That said, you can fan-in data to a single consumer. Here's an example:

foo1 = jdbc | transform | hdfs

foo2 = jdbc > :foo1.jdbc

foo3 = jdbc > :foo1.jdbc

foo4 = jdbc > :foo1.jdbc

Here, foo1 is the primary pipeline reading data from a particular DB/Table combination. Likewise, foo2, foo3, and foo4 could read from other DB/Table combinations. However, these 3 streams are writing the consumed data to a named-destination, which in this case happens to be foo1.jdbc (aka: topic name). This destination is automatically created by SCDF when deploying the foo1 pipeline; specifically to connect "jdbc" and "transform" Apps with the foo1.jdbc topic.

In summary, we are routing the different table data to land in the same destination, so the downstream App, in this case, the transform processor gets the data from different tables.

If the correlation of data is important, you can partition the data at the producer by a unique key (e.g., customer-id = 1001) at each jdbc source, so context-specific information land at the same transform processor instance (assuming you've "n" number of processor instances for scaled-out processing).