Search code examples
apache-kafkasynchronizationiotdistributed-systemstream-processing

Synchronize Data From Multiple Data Sources


Our team is trying to build a predictive maintenance system whose task is to look at a set of events and predict whether these events depict a set of known anomalies or not.

We are at the design phase and the current system design is as follows:

  • The events may occur on multiple sources of an IoT system (such as cloud platform, edge devices or any intermediate platforms)
  • The events are pushed by the data sources into a message queueing system (currently we have chosen Apache Kafka).
  • Each data source has its own queue (Kafka Topic).
  • From the queues, the data is consumed by multiple inference engines (which are actually neural networks).
  • Depending upon the feature set, an inference engine will subscribe to multiple Kafka topics and stream data from those topics to continuously output the inference.
  • The overall architecture follows the single-responsibility principle meaning that every component will be separate from each other and run inside a separate Docker container.

Problem:

In order to classify a set of events as an anomaly, the events have to occur in the same time window. e.g. say there are three data sources pushing their respective events into Kafka topics, but due to some reason, the data is not synchronized. So one of the inference engines pulls the latest entries from each of the kafka topics, but the corresponding events in the pulled data do not belong to the same time window (say 1 hour). That will result in invalid predictions due to out-of-sync data.

Question

We need to figure out how can we make sure that the data from all three sources are pushed in-order so that when an inference engine requests entries (say the last 100 entries) from multiple kakfa topics, the corresponding entries in each topic belong to the same time window?


Solution

  • I would suggest KSQL, which is a streaming SQL engine that enables real-time data processing against Apache Kafka. It also provides nice functionality for Windowed Aggregation etc.

    There are 3 ways to define Windows in KSQL:

    hopping windows, tumbling windows, and session windows. Hopping and tumbling windows are time windows, because they're defined by fixed durations they you specify. Session windows are dynamically sized based on incoming data and defined by periods of activity separated by gaps of inactivity.

    In your context, you can use KSQL to query and aggregate the topics of interest using Windowed Joins. For example,

    SELECT t1.id, ...
    FROM topic_1 t1
    INNER JOIN topic_2 t2
    WITHIN 1 HOURS
    ON t1.id = t2.id;