I came across two configuration properties of HDFS Sink in Flume documentation:
hdfs.rollCount Number of events written to file before it rolled (0 = never roll based on number of events)
and
hdfs.batchSize number of events written to file before it is flushed to HDFS
I want to know the difference between these two properties, and difference of roll and flush as well. It seems they look the same to me.
In HDFS Sink, roll means closing the current file and writing coming events to a new file. There are three different ways of rolling in this sink which are rollCount, rollInterval and rollSize.
Batch is used to determine how often you want to commit from the channel. This helps significantly when you are using a file channel. Since each commit will remove the event(s) from channel, Less commit calls results in less random I/O to disk and better throughput.