Search code examples
hadoopapache-flinkflink-streamingflink-sql

How to understand streaming table in Flink?


It's hard for me to understand the streaming table in Flink. I can understand Hive, map a fixed, static data file to a "table" but how to embody a table built on streaming data?

For example, every 1 second, 5 events with same structure are sent to a Kafka stream:

{"num":1, "value": "a"} 
{"num":2, "value": "b"}
....

What does the dynamic table built on them look like? Flink consumes them all and store them somewhere (memory, local file, hdfs, etc.) then map to a table? Once the "transformmer" finishes processing these 5 events then clear the data and refill the "table" with 5 new events?

Any help is appreciated...


Solution

  • These dynamic tables don't necessarily exist anywhere -- it's simply an abstraction that may, or may not, be materialized, depending on the needs of the query being performed. For example, a query that is doing a simple projection

    SELECT a, b FROM events

    can be executed by simply streaming each record through a stateless Flink pipeline.

    Also, Flink doesn't operate on mini-batches -- it processes each event one at a time. So there's no physical "table", or partial table, anywhere.

    But some queries do require some state, perhaps very little, such as

    SELECT count(*) FROM events

    which needs nothing more than a single counter, while something like

    SELECT key, count(*) FROM events GROUP BY key

    will use Flink's key-partitioned state (a sharded key-value store) to persist the current counter for each key. Different nodes in the cluster will be responsible for handling events for different keys.

    Just as "normal" SQL takes one or more tables as input, and produces a table as output, stream SQL takes one or streams as input, and produces a stream as output. For example, the SELECT count(*) FROM events will produce the stream 1 2 3 4 5 ... as its result.

    There are some good introductions to Flink SQL on YouTube: https://www.google.com/search?q=flink+sql+hueske+walther, and there are training materials on github with slides and exercises: https://github.com/ververica/sql-training.