Search code examples
apache-sparkmapreduceapache-spark-sqlrdd

Does Spark internally use Map-Reduce?


Is Spark using Map Reduce internally ? (its own map reduce)

The first time I heard somebody tell me, "Spark uses map-reduce", I was so confused, I always learned that spark was an alterative for Hadoop Map-Reduce.

After I checked in Google I found a web-site that make some too short explanation about that : https://dzone.com/articles/how-does-spark-use-mapreduce

But the rest of Internet only compares Spark and Map-Reduce.

Then somebody explained to me that when spark make an RDD the data is split in different datasets and if you are using for example SPAR.SQL a query that should not be a map reduce like:

select student 
from Table_students 
where name = "Enrique"

Internally Spark is doing a map reduce to retrieve the Data( from the different datasets).

Is this true ?

If I'm using Spark Mlib, to use machine learning, I always heard that machine learning is not compatible with map reduce because it needed many interactions and map-reduce use batch processing..

In Spark Mlib, is Spark Internally using Map reduce too ?


Solution

  • Spark features an advanced Directed Acyclic Graph (DAG) engine supporting cyclic data flow. Each Spark job creates a DAG of task stages to be performed on the cluster. Compared to MapReduce, which creates a DAG with two predefined stages - Map and Reduce, DAGs created by Spark can contain any number of stages. DAG is a strict generalization of MapReduce model. This allows some jobs to complete faster than they would in MapReduce, with simple jobs completing after just one stage, and more complex tasks completing in a single run of many stages, rather than having to be split into multiple jobs.

    So, Spark can write map-reduce program, but actually use DAG inside.

    Reference: