Search code examples
hadoopcluster-computingvirtualizationvirtual-machine

Hadoop virtual cluster vs single machine


I have a question regarding speed & performance of using multiple virtualized nodes in a single machine VS single node on the single machine itself.

which one will perform better?

The reason why I ask this question is because I am currently learning hadoop on a single machine, and I see some tutorials on the internet that shows the use of multiple virtualized nodes in a single machine.

Thank you in advance


Solution

  • There is always some overhead that comes with virtualization, so unless really necessary I wouldn't advise to run Hadoop in a virtualized environment.

    That being said, I know VMWare did a lot of work on making Hadoop work in a virtualized environment, and they have published some benchmarks in which they claim under certain conditions to have better performance with VMs that a native application. I haven't played much with vSphere, but this could be something to look at if you want to explore virtualization further. But don't take the numbers for granted, it really depends on the type of hardware you're running, so in some conditions I think you might gain some performance with VMs, but I'm guessing from experience that in most cases you won't gain anything.

    If you're just getting started and testing with Hadoop, I think virtualizing is overkill. You can very easily run Hadoop in pseudo-distributed mode, which means that you can run multiple Hadoop daemons on the same box, each as a separate process. That's what I used to get started with Hadoop, and it's a good head start. You can find more info here (or might need another page depending on which Hadoop version you're running).

    If you get to the point where you want to test with a real cluster, but don't have the resources, I would advise looking at Amazon Elastic Map/Reduce: it gives you a cluster on demand and it's pretty cheap. That way you can do more advanced tests. More info here.

    the bottom line is, I think if the purpose is simply testing, you don't really need a virtual cluster.