Search code examples
hadoopmapreducehdfs

in Hadoop, How can you give whole file as input to mapper?


An interviewer recently asked me this question:

I said by configuring block size or split size equal to file size.

He said it is wrong.


Solution

  • Well if you told it like that I think that he didn't like the "configuring block size" part.

    EDIT : Somehow I think changing block size is a bad idea because it is global to HDFS.

    On the other hand a solution to prevent splitting, would be to set the min split size bigger than the largest file to map.

    A cleaner solution would be to subclass the concerned InputFormat implementation. Especially by overriding the isSpitable() method to return false. In your case you could do something like this with FileInputFormat:

    public class NoSplitFileInputFormat extends FileInputFormat 
    {
    
        @Override
        protected boolean isSplitable(JobContext context, Path file) 
        {
            return false;
        }
    }