Search code examples
javajoinspark-streamingapache-kafka-streams

KStream join fires join function instantly, how to delay it at the end of window?


As explained in the comprehensive article Crossing the Streams. The Outer KStream-KStream Join emits element as soon as it arrives, even before waiting for its match in another K-Stream. Downside of this is that it duplicates not-joined event along with every joined event.

Can you suggest any alternate way to implement a join of events without duplicating(as in outer join) or missing(as in inner join)?


As per the same click-view events example:

KStream<String, JsonNode> joinedEventsStream = 
     clickEventsStream.outerJoin(viewEventsStream,
            (clickEvent, viewEvent) -> processJoin(clickEvent, viewEvent),/* Fire quickly if match found,*/
                                                                          /* else fire after 2 seconds */
            JoinWindows.of(Duration.ofSeconds(2L)), StreamJoined.with(Serdes.String(), jsonSerde, jsonSerde)
    );

Expected results are explained below:

  • a click event arrives 1 sec after the view - Joined events (A,A)
  • a click event arrives 11 sec after the view - Different events for each. Each one after 2 seconds(Window size) of its arrival.(B,null) (null,B)
  • a view Event arrives 1 sec after the click - Joined events (C,C)
  • there is a view event but no click - Not-joined event after 2 seconds of its arrival (D,null)
  • there is a click event but no view - Not-joined event after 2 seconds of its arrival (null,E)

expected outer join


Solution

  • Atm (Kafka 2.7.0) the behavior is as describe in the blog post. This question came up multiple times already, and we create a ticket recently to change the behavior: https://issues.apache.org/jira/browse/KAFKA-10847

    Atm, you could use a downstream stateful operation after the join, to buffer records until the window end (or maybe better, window close, ie, window end plus grace period) is reached. This allows you to filter out spurious left/outer join result.