Search code examples
apache-kafkahdfsapache-flinkapache-kafka-connect

Is there a way to limit the size of avro files when writing from kafka via hdfs connector?


Currently we used the Flink FsStateBackend checkpointing and set fileStateSizeThreshold to limit the size of data written to avro/json files on HDFS to 128MB. Also closing files after a certain delay in checkpoint actions.

Since we are not using advanced Flink features in in a new project we want to use Kafka Streaming with the Kafka Connect HDFS Connector to write messages directly to hdfs (without spinning up Flink)

However I cannot find if there are options to limit the filesize of the hdfs files from the kafka connector, except maybe flush.size which seem to limit the # of records.

If there are no settings on the connector, how do people manage the filesizes from streaming data on hdfs in another way?


Solution

  • There is no file size option, only time based rotation and flush size. You can set a large flush size, which you never think you'll reach, then a time based rotation will do a best-effort partitioning of large files into date partitions (we've been able to get 4GB output files per topic partition within an hour directory from Connect)

    Personally, I suggest additional tools such as Hive, Pig, DistCp, Flink/Spark, depending on what's available, and not all at once, running in an Oozie job to "compact" these streaming files into larger files.

    See my comment here


    Before Connect, there was Camus, which is now Apache Gobblin. Within that project, it offers the ideas of compaction and late event processing + Hive table creation


    The general answer here is that you have a designated "hot landing zone" for streaming data, then you periodically archive it or "freeze" it (which brings out technology names like Amazon Glacier/Snowball & Snowplow)