Search code examples
apache-kafka-streams

How to wait for KTable consumption for join with parallel execution?


When I execute the following topology with num.stream.threads: 1, then it just works fine. But with num.stream.threads: 8 apparently the processing of projekte is so fast that the two KTables won't be entirely consumed prior to the join, thus some projekt won't have a matching mietobjekt or wirtschaftseinheit. It works flawlessly with GlobalKTables, but I have to use KTables as changes in a mietobjekt or a wirtschaftseinheit must be propagated through.

So, how can I 'wait' or 'delay' execution until both KTables have been consumed completely?

I found this example with custom join processor and transformer implementation, but it seems to be an overkill; https://github.com/confluentinc/kafka-streams-examples/blob/master/src/test/java/io/confluent/examples/streams/CustomStreamTableJoinIntegrationTest.java

Function { projekte: KStream<String, ProjektEvent> ->
            Function { projektstatus: KStream<String, ProjektStatusEvent> ->
                Function { befunde: KStream<String, ProjektBefundAggregat> ->
                    Function { aufgaben: KStream<String, ProjektAufgabeAggregat> ->
                        Function { wirtschaftseinheiten: KTable<String, WirtschaftseinheitAggregat> ->
                            Function { durchfuehrungen: KStream<String, ProjektDurchfuehrungAggregat> ->
                                Function { gruppen: KStream<String, ProjektGruppeAggregat> ->
                                    Function { mietobjekte: KTable<String, MietobjektAggregat> ->
                                        projekte
                                            .leftJoin(wirtschaftseinheiten)
                                            .leftJoin(mietobjekte)
                                            .cogroup { _, current, previous: ProjektAggregat ->
                                                previous.copy(
                                                    projekt = current.projekt,
                                                    wirtschaftseinheit = current.wirtschaftseinheit,
                                                    mietobjekt = current.mietobjekt,
                                                    projektErstelltAm = current.projektErstelltAm
                                                )
                                            }
                                            .cogroup(projektstatus.groupByKey()) { _, projektstatusEvent, aggregat -> aggregat + projektstatusEvent }
                                            .cogroup(befunde.groupByKey()) { _, befundAggregat, aggregat -> aggregat + befundAggregat }
                                            .cogroup(aufgaben.groupByKey()) { _, aufgabeAggregat, aggregat -> aggregat + aufgabeAggregat }
                                            .cogroup(durchfuehrungen.groupByKey()) { _, durchfuehrungAggregat, aggregat -> aggregat + durchfuehrungAggregat }
                                            .cogroup(gruppen.groupByKey()) { _, gruppeAggregat, aggregat -> aggregat + gruppeAggregat }
                                            .aggregate({ ProjektAggregat() }, Materialized.`as`(projektStoreSupplier))
                                            .toStream()
                                            .filterNot { _, projektAggregat -> projektAggregat.projekt == null }
                                            .transform({ EventTypeHeaderTransformer() })
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }

Solution

  • Processing order between topics is based on timestamps. You can increase max.task.idle.ms to get better guarantees on timestamp synchronization.

    Thus, if you want to "bootstrap" a KTable, you need to ensure that the record timestamps on the "table topic" are smaller than on the "stream topic".

    Also checkout these talks: