Search code examples
javahadoophdfsdistributed-computing

Read HDFS file splits


With HDFS's Java API, it's straightforward to read a file sequentially reading each block at a time. Here's a simple example.

I want to be able to read the file one block at a time using something like HDFS's FileSplits. The end goal is to read a file in parallel with multiple machines, each machine reading a zone of blocks. Given a HDFS Path, how can I get the FileSplits or blocks?

Map-Reduce and other processors are not involved. This is strictly a file system level operation.


Solution

  • This is how you would get the blocks locations of a File in HDFS

      Path dataset = new Path(fs.getHomeDirectory(), <path-to-file>);
      FileStatus datasetFile = fs.getFileStatus(dataset);
    
      BlockLocation myBlocks [] = fs.getFileBlockLocations(datasetFile,0,datasetFile.getLen());
      for(BlockLocation b : myBlocks){
        System.out.println("Length "+b.getLength());
        for(String host : b.getHosts()){
          System.out.println("host "+host);
        }
      }