Search code examples
hadoophdfshadoop2

are files divided into blocks for storing in HDFS?


I understand that the block system in HDFS is a logical partition on top of underlying file system. But how is the file retrieved when I issue a cat command.

Let say I have a 1 GB file. My default HDFS block size is 64 MB.

I issue the following the command:

hadoop -fs copyFromLocal my1GBfile.db input/data/

The above command copies the file my1GBfile.db from my local machine to input/data directory in HDFS:

I have 16 blocks to be copied and replicated ( 1 GB / 64 MB ~ 16 ).

If I have 8 datanodes, a single datanode might not have all blocks to reconsitute the file.

when I issue the following command

hadoop -fs cat input/data/my1GBfile.db | head 

what happens now?

How is the file reconstituted? Although blocks are just logical partitions, how is the 1 GB file physically stored. It is stored on HDFS. does each datanode get some physical portion of the file. so by breaking input 1GB file into 64 MB chunks, we might break something at record level (say in between the line). How is this handled?

I checked in my datanode and I do see a blk_1073741825, which when opened in editor actually displays contents of the file.

so is the chunks of files that is made is not logical but real partition of data happens?

kindly help clarify this


Solution

  • Blocks are literally just files on a datanode. When you cat a file in HDFS, your machine streams these blocks directly from their respective datanodes and reconstructs the entire file locally.