Search code examples
javahdfsavroapache-kafka-connectconfluent-platform

Get size of file while writer is still open on HDFS


I'm trying to poll the file size of a temp. avro file that's being written to on HDFS from a Kafka Topic, but org.apache.hadoop.fs.FileStatus keeps returning 0 bytes (.getLen()), while the writer is still open and writing.

I could keep a counter of length at the writer end, but deep down the data is converted into a binary format (avro) that differs in length from the original record. It could be approximated, but I'm looking for an more precise solution.

Is there a way to get the size of a still open hdfs file from either the hdfs (io.confluent.connect.hdfs.storage.HdfsStorage) perspective or the file writer (io.confluent.connect.storage.format.RecordWriter) perspective?


Solution

  • In the end I extended the RecordWriter used in the AvroRecordWriterProvider and included a wrapper around the FSDataOutputStream to poll for current size in the TopicPartitionWriter

    After legal has cleared it I will push the code to a fork and provide a link to all who are interested.