Search code examples
apache-flinkflink-streamingstream-processing

What is the best way to merge multiple Flink DataStreams?


I'm searching for the best way to merge multiple (>20) Flink streams that represent different origins of events in our system, All have the same type.

List<DataStream<Event>> dataStreams = ...

Where each object is a POJO (an abstract representation obviously)

public class Event implements Serializable {
  public String userId;
  public long eventTimestamp;
  public String eventData;

}

I eventually want to end up with a single stream

DataStream<Event> merged;

There are different ways to manage that: join , coGroup, map/flatMap (using CoGroup) & union. I'm not sure which of them will give me the quickest throughput of the events from the original streams to the merged one. Moreover, Is there an operator that will be used on all streams at once or should I just call on each 2 streams at a time?

I'm looking to get one stream which then will be keyedBy userId field, does that make any difference?

On a side note, the next step is to 'sort' the events (in each window) for each userId by the eventTimestamp to get a chronological order of events per such userId.


Solution

  • If the events have the same type I would surely go with union as it's the simplest form and the easiest one. Also, notice that the union takes vararg as the parameter, which basically means that You can join all the streams in one call.