Search code examples
apache-sparkmapreducehadoop-yarnhadoop2mrv2

YARN and MapReduce Framework


I am aware of the basics of YARN framework, however I still feel lack of some understanding, in regards to MapReduce.

With YARN, I have read that MapReduce is just one of the applications which can run on top of YARN; for example, with YARN, on same cluster various different jobs can run, MapReduce Jobs, Spark Jobs etc.

Now, the point is, each type of job has its "own" kind of "Job phases", for example, when we talk about MapReduce, it has various phases like, Mapper, Sorting, Shuffle, Reducer etc.

Specific to this scenario, who "decides", "controls" these phases? Is it MapReduce Framework?

As I understand, YARN is an infrastructure on which different jobs run; so when we submit a MapReduce Job, does it first go to MapReduce framework and then the code is executed by YARN? I have this doubt, because YARN is general purpose execution engine, so it won't be having knowledge of mapper, reducer etc., which is specific to MapReduce (and so different kind of Jobs), so does MapReduce Framework run on top of YARN, with YARN help executing the Jobs, and MapReduce Framework is aware of the phases it has to go through for a particular kind of Job?

Any clarification to understand this would be of great help.


Solution

  • If you take a look at this picture from Hadoop documentation:

    Yarn Architecture

    You'll see that there's no particular "job orchestration" component, but a resource requesting component, called application master. As you mentioned, YARN does resource management and with regards to application orchestration, it stops at an abstract level.

    The per-application ApplicationMaster is, in effect, a framework specific library and is tasked with negotiating resources from the ResourceManager and working with the NodeManager(s) to execute and monitor the tasks.

    When applied to Spark, some of the components in that picture would be:

    • Client: the spark-submit process
    • App Master: Spark's application master that runs driver and application master (cluster mode) or just application master (client mode)
    • Container: spark workers

    Spark's YARN infrastructure provides the application master (in YARN terms), which knows about Spark's architecture. So when the driver runs, either in cluster mode or in client mode, it still decides on jobs/stages/tasks. This must be application/framework-specific (Spark being the "framework" when it comes to YARN).

    From Spark documentation on YARN deployment:

    In cluster mode, the Spark driver runs inside an application master process which is managed by YARN on the cluster, and the client can go away after initiating the application. In client mode, the driver runs in the client process, and the application master is only used for requesting resources from YARN

    You can extend this abstraction to map-reduce, given your understanding of that framework.