Search code examples
amazon-s3apache-flinkavroflink-streaming

Dynamically consume and sink Kafka topics with Flink


I haven't been able to find much information about this online. I'm wondering if its possible to build a Flink app that can dynamically consume all topics matching a regex pattern and sync those topics to s3. Also, each topic being dynamically synced would have Avro messages and the Flink app would use Confluent's Schema Registry.


Solution

  • So lucky man! Flink 1.4 just released a few days ago and this is the first version that provides consuming Kafka topics using REGEX. According to java docs here is how you can use it:

    FlinkKafkaConsumer011

    public FlinkKafkaConsumer011(PatternsubscriptionPattern,DeserializationSchema<T> valueDeserializer,Properties props)
    

    Creates a new Kafka streaming source consumer for Kafka 0.11.x. Use this constructor to subscribe to multiple topics based on a regular expression pattern. If partition discovery is enabled (by setting a non-negative value for FlinkKafkaConsumerBase.KEY_PARTITION_DISCOVERY_INTERVAL_MILLIS in the properties), topics with names matching the pattern will also be subscribed to as they are created on the fly.

    Parameters:

    subscriptionPattern - The regular expression for a pattern of topic names to subscribe to. valueDeserializer - The de-/serializer used to convert between Kafka's byte messages and Flink's objects.

    props - The properties used to configure the Kafka consumer client, and the ZooKeeper client.

    Just notice that running Flink streaming application, it fetch topic data from Zookeeper at intervals specified using the consumer config :

    FlinkKafkaConsumerBase.KEY_PARTITION_DISCOVERY_INTERVAL_MILLIS
    

    It means every consumer should resync the metadata including topics, at some specified intervals.The default value is 5 minutes. So adding a new topic you should expect consumer starts to consume it at most in 5 minutes. You should set this configuration for Flink consumer with your desired time interval.