Right now I have one stream application in SCDF that is pulling data from multiple tables in a database and replicating it to another database. Currently, our goal is to reduce the amount of work that a given stream is doing, so we want to split the stream out into multiple streams and continue replicating the data into the second database.
Are there any recommended design patterns for funneling the processing of these various streams into one?
If I understand this requirement correctly, you'd want to split the ingest piece by DB/Table per App and then merge them all into a single "payload type" for downstream processing.
If you really do want to split the ingest by DB/Table, you can, but you may want to consider the pros/cons, though. One obvious benefit is granularity and that you can independently update the App in isolation, and maybe also reusability. Of course, it brings other challenges. Maintenance, fixes, and releases for individual apps to name a few.
That said, you can fan-in data to a single consumer. Here's an example:
foo1 = jdbc | transform | hdfs
foo2 = jdbc > :foo1.jdbc
foo3 = jdbc > :foo1.jdbc
foo4 = jdbc > :foo1.jdbc
Here, foo1
is the primary pipeline reading data from a particular DB/Table combination. Likewise, foo2
, foo3
, and foo4
could read from other DB/Table combinations. However, these 3 streams are writing the consumed data to a named-destination, which in this case happens to be foo1.jdbc
(aka: topic name). This destination is automatically created by SCDF when deploying the foo1
pipeline; specifically to connect "jdbc" and "transform" Apps with the foo1.jdbc
topic.
In summary, we are routing the different table data to land in the same destination, so the downstream App, in this case, the transform
processor gets the data from different tables.
If the correlation of data is important, you can partition the data at the producer by a unique key (e.g., customer-id = 1001) at each jdbc
source, so context-specific information land at the same transform
processor instance (assuming you've "n" number of processor instances for scaled-out processing).