Flink - Join same stream in order to filter some events

I have a stream of data that looks like this:

impressionId | id | name | eventType | timestamp

I need to filter (ignore) event of type "click" that don't have a matching 'impressionId' of type 'impression' (so basically ignore clicks event that don't have an impression) and then count how many impressions in total I have and how many clicks I have (for an id/name pair) for a particular time window.

This is how I approached the solution:

[...]
Table eventsTable = tEnv.fromDataStream(eventStreamWithTimeStamp, "impressionId, id, name, eventType, eventTime.rowtime");
tEnv.registerTable("Events", eventsTable);
   

Table clicksTable = eventsTable
      .where("eventType = 'click'")
      .window(Slide.over("24.hour").every("1.minute").on("eventTime").as("minuteWindow"))
      .groupBy("impressionId, id, name, eventType, minuteWindow")
      .select("impressionId as clickImpressionId, eventType as clickEventType, concat(concat(id,'_'), name) as concatClickId, id as clickId, name as clickName, minuteWindow.rowtime as clickMinute");

Table impressionsTable = eventsTable
      .where("eventType = 'impression'")
      .window(Slide.over("24.hour").every("1.minute").on("eventTime").as("minuteWindow"))
      .groupBy("impressionId, id, name, eventType, minuteWindow")
      .select("impressionId as impressionImpressionId, eventType as impressionEventType, concat(concat(id,'_'), name) as concatImpId, id as impId, name as impName, minuteWindow.rowtime as impMinute");

Table filteredClickCount = clicksTable
      .join(impressionsTable, "clickImpressionId = impressionImpressionId && concatClickId = concatImpId && clickMinute = impMinute")
      .window(Slide.over("24.hour").every("1.minute").on("clickMinute").as("minuteWindow"))
      .groupBy("concatClickId, clickMinute")
      .select("concatClickId, concatClickId.count as clickCount, clickMinute as eventTime");
 DataStream<Test3> result = tEnv.toAppendStream(filteredClickCount, Test3.class);
result.print();

What I'm trying to do is simply create two tables, one with clicks and one with impressions, 'inner' join clicks to impressions and the one that are joined means they are the clicks that have a matching impression.

Now this doesn't work and I don't know why!?

the count produced by the last joint table are not correct. It works for the first minute but after that the counts are off by almost double.

I have then tried to modify the last table like this:

Table clickWithMatchingImpression2 = clicksTable
      .join(impressionsTable, "clickImpressionId = impressionImpressionId && concatClickId = concatImpId && clickMinute = impMinute")
      .groupBy("concatClickId, clickMinute")
      .select("concatClickId, concatClickId.count as clickCount, clickMinute as eventTime");

DataStream<Tuple3<Boolean, Tuple3>> result2 = tEnv.toRetractStream(clickWithMatchingImpression2, Test3.class);
    result2.print();

And.... this works !? However I don't know why and I don't know what to do with this DataStream<Tuple3<Boolean, Test3>> format... Flink refuse to use toAppendStream when the table don't have a window. I would like a simply structure with only the final numbers.

1 ) Is my approach correct? Is there an easier way of filtering click that don't have impressions ?

2 ) Why does the counts are not correct in my solution ?

Solution

I am not entirely sure if I understood your use case correctly, an example with some data points would definitely help here.

Let me explain what your code is doing. First the two tables calculate how many clicks/impressions there were in the last 24 hours. For an input

new Event("1", "1", "ABC", "...", 1),
new Event("1", "2", "ABC", "...", 2),
new Event("1", "3", "ABC", "...", 3),
new Event("1", "4", "ABC", "...", 4)

You will get windows (array<eventId>, window_start, window_end, rowtime):

[1], 1969-12-31-01T00:01:00.000, 1970-01-01T00:01:00.000, 1970-01-01T00:00:59.999
[1, 2], 1969-12-31-01T00:02:00.000, 1970-01-01T00:02:00.000, 1970-01-01T00:01:59.999
[1, 2, 3], 1969-12-31-01T00:03:00.000, 1970-01-01T00:03:00.000, 1970-01-01T00:02:59.999
...

Therefore when you group both on id and name you get sth like:

1, '...', '1_ABC', 1, 'ABC', 1970-01-01T00:00:59.999
1, '...', '1_ABC', 1, 'ABC', 1970-01-01T00:01:59.999
1, '...', '1_ABC', 1, 'ABC', 1970-01-01T00:02:59.999
...

which if you group again in 24 hours windows you will count each event with the same id multiple times.

If I understand your use case correctly and you are looking for how many impressions happened in a 1 minute period around an occurrence of a click, an interval join might be what you are looking for. You could implement your case with a following query:

Table clicks = eventsTable
        .where($("eventType").isEqual("click"))
        .select(
                $("impressionId").as("clickImpressionId"),
                concat($("id"), "_", $("name")).as("concatClickId"),
                $("id").as("clickId"),
                $("name").as("clickName"),
                $("eventTime").as("clickEventTime")
        );

Table impressions = eventsTable
        .where($("eventType").isEqual("impression"))
        .select(
                $("impressionId").as("impressionImpressionId"),
                concat($("id"), "_", $("name")).as("concatImpressionId"),
                $("id").as("impressionId"),
                $("name").as("impressionName"),
                $("eventTime").as("impressionEventTime")
        );

Table table = impressions.join(
        clicks,
        $("clickImpressionId").isEqual($("impressionImpressionId"))
                .and(
                        $("clickEventTime").between(
                                $("impressionEventTime").minus(lit(1).minutes()),
                                $("impressionEventTime"))
                ))
        .select($("concatClickId"), $("impressionEventTime"));

table
        .window(Slide.over("24.hour").every("1.minute").on("impressionEventTime").as("minuteWindow"))
        .groupBy($("concatClickId"), $("minuteWindow"))
        .select($("concatClickId"), $("concatClickId").count())
        .execute()
        .print();

As for why Flink sometimes cannot produce append stream, but only retract stream see. Very briefly, if an operation does not work based on a time attribute, there is not single point in time, when the result is "valid". Therefore it must emit stream of changes instead of a single appended value. The first field in the tuple tells you if the record is an insertion(true) or retraction/deletion(false).