Search code examples
apachehdfsflume

Apache Flume sampling rate


Is it possible to specify a sampling rate to Flume before the records get written to HDFS? Is there some flume sink config for doing that or do we need to write our own Flume interceptor for sampling? I could not find any documentation on the Apache Flume user guide page.


Solution

  • Yes you can achieve that by specifying batch sizes in hdfs sink:

    hdfs.batchSize = 100 // 100 is the default.
    

    You should also make sure that you specify a channel capacity that's large enough, too.