Search code examples
pysparkapache-flinkspark-streamingpyflink

PyFlink performance compared to Scala


How PyFlink performance is compared to Flink + Scala?

Big Picture. The goal is to build Lambda architecture with Cold and Hot Tier. Cold (Batch) Tier will be implemented with Apache Spark (PySpark). But with Hot (Streaming) Tier there are different options: Spark Streaming or Flink.

Thus Apache Flink is pure streaming rather then Spark's micro-batches, I tend to choose Apache Flink. But my only point of concern is performance of PyFlink. Will it have less latency that PySpark streaming? Is it slower then Scala written Flink code? In what cases it's slower?

Thank you in advance!


Solution

  • I had implemented something very similar , and from my experience these are a few things

    1. Performance of the job is completely dependent on the type of code you are writing , if you are using some custom UDFs written in python to run while you extract then the performance is going to be slower than doing the same thing using Scala based code - this happens majorly because of the conversion of python objects to JVM and vice versa . But this will happen while you are using Pyspark .
    2. Flink is true streaming process, the micro batches in spark are not so if your use case does need a true streaming service go ahead with Flink.

    If you stick your service to the native functions given in PyFlink you will not observe any noticeable difference in performance .