Search code examples
hadoophdfsflume

What does Hadoop do with unreplicated data when client closes its connection?


I am running a Hadoop 2.5.0-cdh5.3.2 cluster. Flume is running elsewhere writing data to this cluster. When the cluster is under heavy load, the flume-agent finishes writing and attempts to close the file before HDFS has finished replicating the data. The close fails and is retried, but the flume-agent is configured with a timeout and when the close cannot complete in time, the flume-agent disconnects.

What does HDFS do with the file that has not finished replication? I was under the impression a background thread would finish the replication, but I am seeing only partially written blocks in my cluster. There is one good copy of the block and the replicas are only partially written, so HDFS considers the block corrupt.

I've read through the recovery process and did not think I'd be left with unwritten blocks.

I have the following client settings:

dfs.client.block.write.replace-datanode-on-failure.enable=true
dfs.client.block.write.replace-datanode-on-failure.policy=ALWAYS
dfs.client.block.write.replace-datanode-on-failure.best-effort=true

I set these because it seemed that the flume-agent was losing connections to datanodes and failing. I wanted it to retry, but if a block was written, to call it good and move on.

Is best-effort preventing the remaining blocks from being written? This seems pretty useless if it results in the final block being called corrupt.


Solution

  • I think flume agend is loosing hdfs connection before it can successfully close the file. DFS client caches some data locally. Before closing the file, it must flush this local cache. If the hdfs connection is lost, close will fail and the block will be marked corrupt. There is one scenario in which hdfs connection is unexpectedly closed. Hdfs Client registers shutdown hooks. The order in which shutdown hook are is invoked is not guaranteed. In your case if the flume agent is shutting down, hdfs client shutdown may get invoked and file close will fail. If you think this is possible, try disabling shutdown hooks.

    fs.automatic.close = false