I'm trying to poll the file size of a temp. avro file that's being written to on HDFS from a Kafka Topic, but org.apache.hadoop.fs.FileStatus
keeps returning 0 bytes (.getLen()
), while the writer is still open and writing.
I could keep a counter of length at the writer end, but deep down the data is converted into a binary format (avro) that differs in length from the original record. It could be approximated, but I'm looking for an more precise solution.
Is there a way to get the size of a still open hdfs file from either the hdfs (io.confluent.connect.hdfs.storage.HdfsStorage
) perspective or the file writer (io.confluent.connect.storage.format.RecordWriter
) perspective?
In the end I extended the RecordWriter
used in the AvroRecordWriterProvider
and included a wrapper around the FSDataOutputStream
to poll for current size in the TopicPartitionWriter
After legal has cleared it I will push the code to a fork and provide a link to all who are interested.