Search code examples
google-cloud-platformdataflow

What is the best Apache Beam language that supports Google Dataflow?


I'm having a question while I'm writing code with Apache-beam using Dataflow.

Originally, I wrote code with python, but I checked java, go, and scio among the supported languages.

Please give us feedback on whether there is a language that has the best performance.

Or is there more library support?

It's my personal curiosity, but it's hard to summarize the contents in the document, so I wrote a question. Thank you.


Solution

  • It's very opiniated question but I will try to answer from my knowledge and experience.

    Java has been the first language released on Beam with a full set of feature (Streaming, batch, windowing,...).

    Python has been coming after, with limited feature at the beginning and an enrichment afterward (no streaming, then streaming without windowing,...). Beam, and Dataflow, don't process data in Python, it's absolutely not efficient. Python language is a wrapper in Java code to most efficient processing. And that's why Python is always behind Java in term of feature.

    Go SDK is a new one and I never tested it, too long time in Alpha, I never took time to try it.

    Now, on Dataflow, the things have changed as described here. The v2 engine use only the language as description of the pipeline and the processing is performed in C++.

    So, the difference in term of feature could continue to exist, but will disappear a day. The performance will be the same.