Search code examples
hadoopelastic-map-reduce

How does hadoop handle large files?


I am totally new to Hadoop though I understand the concept of map reduce fairly well.

Most of the Hadoop tutorials start with the WordCount example. So I wrote a simple wordcount program which worked perfectly well. But then I am trying to take word count of a very large document. (Over 50GB).

So my question to the Hadoop experts is, how will Hadoop handle the large file ? Will it transfer the copies of the file to each mapper or will it automatically split it into blocks and transfer those blocks to mappers ?

Most of my experience with MapReduce was because of CouchDB where mapper handles on document at a time but from what I read about Hadoop, I wonder if it is designed to handle multiple small files or few large files or both ?


Solution

  • Hadoop handles large files by splitting them to blocks of size 64MB or 128MB (default). These blocks are available across Datanodes and metadata is in Namenode. When mapreduce program runs each block gets a mapper for execution. You cannot set number of mappers. When mappers are done they are sent to reducer. The default number of reducers is one and can be set and thats where you get the output. It can even handle multiple small files but its preferable to group them to large file for better performance. For eg. if each small file is less than 64MB then each file gets a mapper for execution. Hope this helps!