Search code examples
javahadoopmapreducecluster-computingdistributed-computing

Is it necessary to execute a task on Hadoop DataNode?


Is this the way Hadoop works?

  1. Client submit a MapReducer job/program to NameNode.

  2. JobTracker (resides on NameNode) allocates task to the slave task trackers that are running on individual worker machines(date nodes)

  3. Each Tasktracker is responsible to execute and manage the individual tasks assigned by Job Tracker

According to above scenario MapReducer program will run on slave node. Does it means that Job is going to consume Slave computation Engine or Processing Power?.

What if I want to use another machine (independent to Hadoop installation system) to execute MapReduce job and uses Hadoop Clusters data?

Why should I use Hadoop clusters? Hadoop distribute the large data in a very efficient way to their DataNode(s) .

The new scenario would be as follow:

a. Server

b. Client

a.1 ) Distribute the un-ordered data using Hadoop Clusters

b.1) Client will execute (not submitted to NameNode) a MapReducer job which is getting data from Hadoop Clusters datanode. If it's possible then what will happen to JobTracker (NameNode) and Tasktracker (DataNode) ?

I am ignoring the major part of Hadoop over here here by executing the job at client machine but that is my project requirement. Any suggestion on it?


Solution

  • You are right in the first part. Firstly the architecture with jobTracker and TaskTracker is for Hadoop 1. You should look to Hadoop 2 which is the most recent architecture.

    You have confusion with HDFS and MapReduce.

    • HDFS : It is the distributed file system of Hadoop. The NameNode is the master of the cluster. It contains the metadata and the localisation of the files. The DataNodes are the slaves of the cluster. They store the data accross the cluster.

    • MapReduce : So the "new" architecture is called Yarn and run like that : You have a master role, the RessourceManager and some slave, the nodemanagers. When you submit a MapReduce jar to the cluster, the RessourceManager will allocate the process to nodemanager. To simplify, each nodemanager will execute the programm on a part of a file stored in HDFS.

    So just correctly separate HDFS role and MapReduce role.