Im trying to config Flume so it uses at least close the block size of HDFS which is 128mb in my case. This is my config which is writing about 10mb per file:
###############################
httpagent.sources = http-source
httpagent.sinks = k1
httpagent.channels = ch3
# Define / Configure Source (multiport seems to support newer "stuff")
###############################
httpagent.sources.http-source.type = org.apache.flume.source.http.HTTPSource
httpagent.sources.http-source.channels = ch3
httpagent.sources.http-source.port = 5140
httpagent.sinks = k1
httpagent.sinks.k1.type = hdfs
httpagent.sinks.k1.channel = ch3
httpagent.sinks.k1.hdfs.path = hdfs://r3608/hadoop/hdfs/data/flumechannel3/0.5/
httpagent.sinks.k1.hdfs.fileType = DataStream
httpagent.sinks.HDFS.hdfs.writeFormat = Text
httpagent.sinks.k1.hdfs.rollCount = 0
httpagent.sinks.k1.hdfs.batchSize = 10000
httpagent.sinks.k1.hdfs.rollSize = 0
httpagent.sinks.log-sink.channel = memory
httpagent.sinks.log-sink.type = logger
# Channels
###############################
httpagent.channels = ch3
httpagent.channels.ch3.type = memory
httpagent.channels.ch3.capacity = 100000
httpagent.channels.ch3.transactionCapacity = 80000
So the problem is i cant get it to write about 100mb files.. i would expect to write around 100mb at least if i change config like this :
httpagent.sinks = k1
httpagent.sinks.k1.type = hdfs
httpagent.sinks.k1.channel = ch3
httpagent.sinks.k1.hdfs.path = hdfs://r3608/hadoop/hdfs/data/flumechannel3/0.4test/
httpagent.sinks.k1.hdfs.fileType = DataStream
httpagent.sinks.HDFS.hdfs.writeFormat = Text
httpagent.sinks.k1.hdfs.rollSize = 100000000
httpagent.sinks.k1.hdfs.rollCount = 0
But then files get even smaller and hes writing about 3-8mb files... Since its not really possible to aggregate files ones they are in hdfs i really want to get this files bigger. Is there something im not getting about the rollSize Parameter? or is there some default value so hell never write that big files?
you need to override rollInterval to 0,never roll based on time interval:
httpagent.sinks.k1.hdfs.rollInterval = 0