Search code examples
hadoophadoop-partitioning

Hadoop FileInputFormat isSplitable false


I have a quick question that I think I know the answer to about the FileInputFormat isSplitable method. If I override this method to return false, naturally I'll have a single mapper process a file (I only have 1 file). If this file is distributed across HDFS, all of it will get pulled to my single mapper. When I process it with the mapper and create the key/values pairs to send to reducers, if I create a large number of them, will they get distributed then across my cluster as to take advantage of data locality or is there some kind of implicit consequence that if I made it isSplitable false that that doesn't happen anymore?


Solution

  • When isSplitable returns false only a single mapper processes the entire file. The mapper can emit any number of KV pairs.

    Coming to the reducer, there is no concept of data locality, the next available free Reduce slot is used.. FYI, in case of legacy MR architecture, there are slots for Map and Reduce on each node, but in case of YARN there is no concept of slots.

    The reducers can be spread across on multiple nodes based on the availability of the slots or based on what the ResourceManager returns in case of YARN.