Search code examples
mahoutapache-spark

Mahout on Spark


I have planed using some of the clustering algorithms offered with Mahout, running on Hadoop.

Now I see that there is a change and that Mahout is going from MapReduce to Spark.

That confuses me, how to implement a system like that? Do I even need Hadoop, and for what? And how to combine Mahout and Spark?

Thanks


Solution

  • Some useful facts:

    • Hadoop is two things 1) a distributed resilient file system. 2) a mapreduce distributed execution platform.
    • Spark uses Hadoop's filesystem (HDFS)
    • Mahout still has many algorithms implemented in Hadoop mapreduce
    • Here is a page explaining what algorithms are based on which platforms: http://mahout.apache.org/users/basics/algorithms.html

    This boils down to the fact that you can install only the things you need or install them all and not worry about what an individually algorithm needs.

    There are several ways to install Spark + Hadoop on the same cluster or on a single machine, the simplest in non-coordinated (which is very simple), the most efficient is to use a coordinating manager like Mesos or Hadoop's Yarn, which is recommended for large heavily used or production clusters.

    When to install Hadoop

    Basically Hadoop is always needed. If you are using the Mahout clustering it only requires Hadoop using HDFS and mapreduce--so Spark is not required. If you need Spark there is another library called MLlib that has some clustering algorithms.

    Here is a page explaining what algorithms are based on which platforms: http://mahout.apache.org/users/basics/algorithms.html

    When to install Spark

    As of today there is an extensive matrix/vector/linear algebra DSL in Scala including some collaborative filtering algorithms on Spark. So Spark is only needed for those but more is being implemented as we write.