Search code examples
apache-flinkflink-streamingpyflink

Performance difference between Table- and DataStream-API


Let us assume that I have two operations which I can write easily using both APIs in PyFlink (e.g. a sum of a column over a TumblingWindow). Are there any performance differences when I use the predefined Table-API commands vs manually implementing the count in Python as a ProcessWindowFunction?

To be more precise, I want to compare

table_from_stream \
    .window(Tumble.over(lit(15).minutes)).on(col('time')).alias('w'))
    .groupby(col('w'), col('a'))
    .select(col('w').end, col('a'), col('b').sum)

vs

datastream \
    .key_by('a') \
    .window(TumblingEventTimeWindows.of(Time.minutes(15))) \ 
    .process(MyProcessFunctionThatManuallySums)

Solution

  • The version using the Table API will be more efficient because

    • The job will run entirely in Java, with no need to cross the java/python barrier. Python will only be used to setup the pipeline.
    • A single piece of managed state (per key) will be used to accumulate a running sum, rather than collecting a list of values to add up at the end of each window. OTOH, the datastream job will have to send that possibly very large list to Python.
    • The Table API has a very efficient serializer.
    • Filter pushdown will be used to filter out any columns other than A or B, rather than those columns being serialized and passed through the pipeline.