Let us assume that I have two operations which I can write easily using both APIs in PyFlink (e.g. a sum
of a column over a TumblingWindow
). Are there any performance differences when I use the predefined Table-API commands vs manually implementing the count in Python as a ProcessWindowFunction
?
To be more precise, I want to compare
table_from_stream \
.window(Tumble.over(lit(15).minutes)).on(col('time')).alias('w'))
.groupby(col('w'), col('a'))
.select(col('w').end, col('a'), col('b').sum)
vs
datastream \
.key_by('a') \
.window(TumblingEventTimeWindows.of(Time.minutes(15))) \
.process(MyProcessFunctionThatManuallySums)
The version using the Table API will be more efficient because