Search code examples
hadooplarge-filesflume

Flume Spooling Directory Source: Cannot load files larger files


I am trying to ingest using flume spooling directory to HDFS(SpoolDir > Memory Channel > HDFS).

I am using Cloudera Hadoop 5.4.2. (Hadoop 2.6.0, Flume 1.5.0).

It works well with smaller files, however it fails with larger files. Please find below my testing scenerio:

  1. files with size Kbytes to 50-60MBytes, processed without issue.
  2. files with greater than 50-60MB, it writes around 50MB to HDFS then I found flume agent unexpected exit.
  3. There are no error message on flume log. I found that it is trying to create the ".tmp" file (HDFS) several times, and each time writes couple of megabytes (some time 2MB, some time 45MB ) before unexpected exit. After some time, the last tried ".tmp" file renamed as completed(".tmp" removed) and the file in source spoolDir also renamed as ".COMPLETED" although full file is not written to HDFS.

In real scenerio, our files will be around 2GB in size. So, need some robust flume configuration to handle workload.

Note:

  1. Flume agent node is part of hadoop cluster and not a datanode (it is an edge node).
  2. Spool directory is local filesystem on the same server running flume agent.
  3. All are physical sever (not virtual).
  4. In the same cluster, we have twitter datafeeding with flume running fine(although very small about of data).
  5. Please find below flume.conf file I am using here:

    #############start flume.conf####################
    
    spoolDir.sources = src-1
    
    spoolDir.channels = channel-1
    
    spoolDir.sinks = sink_to_hdfs1
    
    ######## source
    
    
    spoolDir.sources.src-1.type = spooldir
    
    spoolDir.sources.src-1.channels = channel-1
    
    spoolDir.sources.src-1.spoolDir = /stage/ETL/spool/
    
    spoolDir.sources.src-1.fileHeader = true
    
    spoolDir.sources.src-1.basenameHeader =true
    
    spoolDir.sources.src-1.batchSize = 100000
    
    ######## channel
    spoolDir.channels.channel-1.type = memory
    
    spoolDir.channels.channel-1.transactionCapacity = 50000000
    
    spoolDir.channels.channel-1.capacity = 60000000
    
    spoolDir.channels.channel-1.byteCapacityBufferPercentage = 20
    
    spoolDir.channels.channel-1.byteCapacity = 6442450944
    
    ######## sink 
    spoolDir.sinks.sink_to_hdfs1.type = hdfs
    
    spoolDir.sinks.sink_to_hdfs1.channel = channel-1
    
    spoolDir.sinks.sink_to_hdfs1.hdfs.fileType = DataStream
    
    spoolDir.sinks.sink_to_hdfs1.hdfs.path = hdfs://nameservice1/user/etl/temp/spool
    
    spoolDir.sinks.sink_to_hdfs1.hdfs.filePrefix = %{basename}-
    
    spoolDir.sinks.sink_to_hdfs1.hdfs.batchSize = 100000
    
    spoolDir.sinks.sink_to_hdfs1.hdfs.rollInterval = 0
    
    spoolDir.sinks.sink_to_hdfs1.hdfs.rollSize = 0
    
    spoolDir.sinks.sink_to_hdfs1.hdfs.rollCount = 0
    
    spoolDir.sinks.sink_to_hdfs1.hdfs.idleTimeout = 60
    
    #############end flume.conf####################
    

Kindly suggest me whether there is any issue with my configuration or am I missing something.

Or is it a known issue that Flume SpoolDir cannot handle with bigger files.

Regards,

-Obaid

  • I have posted the same topic to another open community, if I get solution from other one, I will update here and vice versa.

Solution

  • I have tested flume with several size files and finally come up with conclusion that "flume is not for larger size files".

    So, finally I have started using HDFS NFS Gateway. This is really cool and now I do not even need a spool directory in local storage. Pushing file directly to nfs mounted HDFS using scp.

    Hope it will help some one who is facing same issue like me.

    Thanks, Obaid