I am trying to ingest using flume spooling directory to HDFS(SpoolDir > Memory Channel > HDFS).
I am using Cloudera Hadoop 5.4.2. (Hadoop 2.6.0, Flume 1.5.0).
It works well with smaller files, however it fails with larger files. Please find below my testing scenerio:
In real scenerio, our files will be around 2GB in size. So, need some robust flume configuration to handle workload.
Note:
Please find below flume.conf file I am using here:
#############start flume.conf####################
spoolDir.sources = src-1
spoolDir.channels = channel-1
spoolDir.sinks = sink_to_hdfs1
######## source
spoolDir.sources.src-1.type = spooldir
spoolDir.sources.src-1.channels = channel-1
spoolDir.sources.src-1.spoolDir = /stage/ETL/spool/
spoolDir.sources.src-1.fileHeader = true
spoolDir.sources.src-1.basenameHeader =true
spoolDir.sources.src-1.batchSize = 100000
######## channel
spoolDir.channels.channel-1.type = memory
spoolDir.channels.channel-1.transactionCapacity = 50000000
spoolDir.channels.channel-1.capacity = 60000000
spoolDir.channels.channel-1.byteCapacityBufferPercentage = 20
spoolDir.channels.channel-1.byteCapacity = 6442450944
######## sink
spoolDir.sinks.sink_to_hdfs1.type = hdfs
spoolDir.sinks.sink_to_hdfs1.channel = channel-1
spoolDir.sinks.sink_to_hdfs1.hdfs.fileType = DataStream
spoolDir.sinks.sink_to_hdfs1.hdfs.path = hdfs://nameservice1/user/etl/temp/spool
spoolDir.sinks.sink_to_hdfs1.hdfs.filePrefix = %{basename}-
spoolDir.sinks.sink_to_hdfs1.hdfs.batchSize = 100000
spoolDir.sinks.sink_to_hdfs1.hdfs.rollInterval = 0
spoolDir.sinks.sink_to_hdfs1.hdfs.rollSize = 0
spoolDir.sinks.sink_to_hdfs1.hdfs.rollCount = 0
spoolDir.sinks.sink_to_hdfs1.hdfs.idleTimeout = 60
#############end flume.conf####################
Kindly suggest me whether there is any issue with my configuration or am I missing something.
Or is it a known issue that Flume SpoolDir cannot handle with bigger files.
Regards,
-Obaid
I have tested flume with several size files and finally come up with conclusion that "flume is not for larger size files".
So, finally I have started using HDFS NFS Gateway. This is really cool and now I do not even need a spool directory in local storage. Pushing file directly to nfs mounted HDFS using scp.
Hope it will help some one who is facing same issue like me.
Thanks, Obaid