Search code examples
architecturemapreducesharding

Map/Reduce on a single server


Does it make sense to do map/reduce on a non sharded architecture ?

Or, in other words, is it effective to do it on a single server.


Solution

  • Overall I disagree with Praveen.

    Yes, I agree that when running on a single system you lose the fault-tolerance properties of the platform. However there are many situations where the platform has useful properties for specific purposes.

    There are many situations where using the Hadoop toolkit has advantages over doing it without Hadoop.

    1. You do not need to worry about the size of the input file. If your input data is many GiB then you can still run it on a system where you only have 512MiB of system RAM available.
    2. With the platform you can make your data processing application run multithreaded without the need to dive into creating threads. You simply deploy your application on a different instance of the platform.
    3. You keep the door open to scaling out over multiple systems. When your application reaches that level then the step towards real horizontal scalability is a very simple one.

    When you have written your processing application using Hadoop you have several options for running it:

    1. Single threaded on a single box using the local filesystem. This way it is simply a commandline Java application that transforms input into output.
    2. With just a jobtracker/tasktracker setup on a single box using the local file system. See this stackoverflow question for more info: Is it possible to run Hadoop in Pseudo-Distributed operation without HDFS?
    3. Full blown on a single system (the pseudo-distributed mode).
    4. Full blown multi system setup.