I just wanted to more about below statement. When I tried to understand how the HDFS writes happens to Data nodes. I got the below explanation about HDFS writes.
why hdfs client sends 4kb to the Data nodes instead of sending entire block 64MB to the data node? Can some explain in detail?
For better performance, data nodes maintain a pipeline for data transfer. Data node 1 does not need to wait for a complete block to arrive before it can start transferring to data node 2 in the flow. In fact, the data transfer from the client to data node 1 for a given block happens in smaller chunks of 4KB. When data node 1 receives the first 4KB chunk from the client, it stores this chunk in its local repository and immediately starts transferring it to data node 2 in the flow. Likewise, when data node 2 receives first 4KB chunk from data node 1, it stores this chunk in its local repository and immediately starts transferring it to data node 3. This way, all the data nodes in the flow except the last one receive data from the previous one and transfer it to the next data node in the flow, to improve the write performance by avoiding a wait time at each stage.
Your question have the answer for it.
In this picture lets assume the file size is equal to block size(128 MB). So **A, B, C .. are the chunks in block**
https://i.sstatic.net/REO6r.jpg
When data node 1 receives the first 4KB(A) chunk from the client, it stores this chunk in its local repository and immediately starts transferring it to data node 2 in the flow. Likewise, when data node 2 receives first 4KB chunk from data node 1, it stores this chunk in its local repository and immediately starts transferring it to data node 3
Here the advantage is Data node 2 and 3 need not to wait till 128 MB data is copied to Data node 1 before starts replication. So, the delay because replication will be just one or 2 chucks copy time as all the chunks copied to nodes in parallel.