Search code examples
apache-flinkflink-streamingflink-sql

Difference between DataStream and Table API in Apache Flink


I am new to Apache Flink and want to understand use case between DataStream and Table API. Please help me understand when to choose Table API over DataStream API.

As per my understanding, things which can be done using Table API can also be done using DataStream API. How do both APIs differ?


Solution

  • The Table API is a relational API that unifies batch and stream processing. The same query can be run on static batch data or on continuous streaming data. The Table API is similar to SQL. Queries are optimized and translated into DataSet (batch) or DataStream (streaming) programs, i.e., Table API queries are executed as DataStream programs. You can implement a lot of custom logic in user-defined functions, but the Table API is centered around relational operations (filter, projection, join, aggregation). Hence it is not surprising that the Table API is mostly used for ETL/data pipelines or data analytics applications.

    The DataStream API is an API to implement stream processing applications and more generic. Most of the logic is implemented as Java or Scala classes. The process functions expose time and state which are the fundamental building blocks for any kind of streaming application. In addition to data pipelines and analytics, you can implement event-driven applications with the DataStream API.

    If you can implement the logic with the Table API go for it. The program will be easier and more concise. Use the DataStream API if you need more control and have a lot of custom logic. By the way, you can easily mix and match both APIs since, a DataStream can be easily converted into a Table and vice versa.