Search code examples
hadoopdistributed-computinggrid-computing

Hadoop not suitable for distributed processing across many sites?


I've read a few articles suggesting that Hadoop is only really designed to work on a cluster at a single physical location, not for a number of widely distributed nodes (e.g. running a distributed cluster over the internet from multiple sites).

Does anyone have any real experience trying to use Hadoop across mutliple sites? What kind of issues will I run into? Or am I better to just go with a different framework (e.g. BOINC).


Solution

  • If there's any difference between executing on a set of relatively local nodes vs on a set of widely distributed nodes it would be in the increased time required to move large amounts of data back and forth between nodes. If you have a problem that involves crunching, aggregating and joining large amounts of data then you will necessarily be sending large amounts of data between your nodes. That means that no matter what platform you choose (hadoop, storm, etc) you will have to deal with this issue. BOINC or some other volunteer-based system may be cheaper, but your implementation will still be hit with high data transfer costs. Furthermore, you'll likely introduce node heterogeneity into the mix which will make your implementation even more interesting to develop and debug.

    And by the way, hadoop and BOINC are two very different animals solving very different problems.