Search code examples
hadoophdfsdistributed-cache

Hadoop Distributed file system vs distributed cache


What is the difference b/w Distributed File System and Distributed Cache in Hadoop?


Solution

  • A Distributed File System, such as the Hadoop Distributed File System (HDFS), is an architecture that allows you to store a large file (or more) in the hard disk of many machines. Each machine holds a portion (called block) of this file. Usually, each block is replicated many times (by default three) in case some of the machines crash. In that case, you can recover the lost blocks by taking their replicas from other machines. Your PC has a File System too, but it is most probably not distributed. It is where your files are structured in hierarchies and stored.

    A Distributed Cache is a means of giving to all the machines the same input file(s) while a job is running. This/These file(s) are loaded in the memory of these machines. Say, for example, that you have a list of stopwords that you don't wish your wordcount program to count. Then, at the beginning of each MapReduce job, you distribute this stopwords file to all the map tasks and these map tasks read it and skip the counting of these stopwords. This way, all the tasks share a common input file. After the job is finished, there is no Distributed Cache...

    My answer might not be technically correct in many ways, but I hope it gives a proper intuition.