Search code examples
hadoopdockercassandracloudera

Is Hadoop in Docker container faster/worth it?


I have a Hadoop based environment. I use Flume, Hue and Cassandra in this system. There is a big hype around Docker nowadays, so would like to examine, what are pros and cons in dockerization in this case. I think it should be much more portable, but it can be set using Cloudera Manager with a few clicks. Is it maybe faster or why is worth it? What are advantages? Maybe should be only multi node Cassandra cluster dockerized?


Solution

  • Is it maybe faster or why is worth it?

    It sounds like you already have a Hadoop cluster. So you have to ask yourself, how long does it take to reproduce this environment? How often do you need to reproduce this environment?

    If you are not needing a way to reproduce the environment repeatedly and and contain dependencies that may be conflicts with other applications on the host, then I don't yet see a use case for you.

    What are advantages?

    If you are running Hadoop in an environment where you may need mixed Java versions, then running it as a container could isolate the dependencies (in this case, Java) from the host system. In some case, it would get you a more easily reproducible artifact to move around and set up. But Java apps are already so simple with all their dependencies included in the JAR.

    Maybe should be only multi node Cassandra cluster dockerized?

    I don't think it really comes down to whether is is a multi-node environment or not. It comes down to the problems it solves. It doesn't sound like you have any pain point in deploying or reproducing Hadoop environments (yet), so I don't see the need to "dockerize" something just because it is the hot new thing on the block.

    When you do have the need to reproduce the Hadoop environment easily, you might look at Docker for some of the orchestration and management tools (Kubernetes, Rancher, etc.) which make deploying and managing clusters of applications on an overlay network much more appetizing than just regular Docker. Docker is just the tool in my eyes. It really starts to shine when you can leverage some of the neat overlay multi-host networking, discovery, and orchestration that other packages are building on top of it.