I am trying to move my files in hdfs from local system using flume but when i am running my flume it is creating many small files. Size of my original file's are 154 - 500Kb but in my HDFS it is creating many files of size 4-5kb. I searched and got to know that changing the rollSize and rollCount will work i increased the values but still same issue is happening. Also i am getting below error.
Error:
ERROR hdfs.BucketWriter: Hit max consecutive under-replication rotations (30); will not continue rolling files under this path due to under-replication
As i am working in cluster i am a bit scared to do changes in the hdfs-site.xml. Please suggest me what i can do to either move original files in HDFS or make the small files more in size (instead of 4-5kb make it 50-60kb).
Below is my configuration.
Configuration:
agent1.sources = source1
agent1.sinks = sink1
agent1.channels = channel1
agent1.sources.source1.channels = channel1
agent1.sinks.sink1.channel = channel1
agent1.sources.source1.type = spooldir
agent1.sources.source1.spoolDir = /root/Downloads/CD/parsedCD
agent1.sources.source1.deletePolicy = immediate
agent1.sources.source1.basenameHeader = true
agent1.sinks.sink1.type = hdfs
agent1.sinks.sink1.hdfs.path = /user/cloudera/flumecd
agent1.sinks.sink1.hdfs.fileType = DataStream
agent1.sinks.sink1.hdfs.filePrefix = %{basename}
agent1.sinks.sink1.hdfs.rollInterval = 0
agent1.sinks.sink1.hdfs.batchsize= 1000
agent1.sinks.sink1.hdfs.rollSize= 1000000
agent1.sinks.sink1.hdfs.rollCount= 0
agent1.channels.channel1.type = memory
agent1.channels.channel1.maxFileSize =900000000
I think the error you are posting is clear enough: the files you are creating are under-replicated (which means the blocks of the files you are creating, and which are distributed along the cluster, have less copies than the replication factor -usually 3-); and while that situation continues in time, no more rolls will be done (because each time you roll the file, a new under-replicated file is created, and the maximum allowed -30- has been reached).
I'll recommend you to check why files are under-replicated. Maybe this is because the cluster is running out of disk, or because the cluster was set up with the minimum number of nodes -i.e. 3 nodes- and one is down -i.e. only 2 datanodes are alive and the replication factor is set to 3-.
Other options (not recommended) would be to decrease the replication factor -even to 1-. Or increase the allowed number of under-replicated rolls (I don't know if such a thing is possible, and even it is possible, in the end you will experience again the same error).