Search code examples
hadoophdfshadoop-yarnfilesplitting

HDFS FileSplit locations


I have a cluster with an installation of hadoop-2.1.0-beta. Is there a way to learn where each filesplit is located in my cluster? What I am looking for is a list such as the following

filesplit_0001 node1
filesplit_0002 node4
...

edit: I know that such a list is available in Microsoft Azure.


Solution

  • The fsck tool provides an easy way to find out which blocks are in any particular file. For example:

    % hadoop fsck <path> -files -blocks -locations -racks
    

    Reference : Hadoop Command Line Guide.

    Edit:

    An input split is a chunk of the input that is processed by a single map. Each map processes a single split. Each split is divided into records, and the map processes each record a key-value pair in turn. Splits and records are logical but HDFS blocks are physical.

    An InputSplit has a length in bytes and a set of storage locations, which are just hostname strings. A split doesn’t contain the input data; it is just a reference to the data.

    You can get InputSplit instance in map method.

    InputSplit inputSplit=context.getInputSplit(); //Input split instance 
    String[] splitLocations = inputSplit.getLocations();