Search code examples
javahadoopscalabilityreal-timemahout

Hadoop, Mahout real-time processing alternative


I intended to use hadoop as "computation cluster" in my project. However then I read that Hadoop is not inteded for real-time systems because of overhead connected with start of a job. I'm looking for solution which could be use this way - jobs which could can be easly scaled into multiple machines but which does not require much input data. What is more I want to use machine learning jobs e.g. using created before neural network in real-time.

What libraries/technologies I can use for this purposes?


Solution

  • You are right, Hadoop is designed for batch-type processing.

    Reading the question, I though about the Storm framework very recently open sourced by Twitter, which can be considered as "Hadoop for real-time processing".

    Storm makes it easy to write and scale complex realtime computations on a cluster of computers, doing for realtime processing what Hadoop did for batch processing. Storm guarantees that every message will be processed. And it's fast — you can process millions of messages per second with a small cluster. Best of all, you can write Storm topologies using any programming language.

    (from: InfoQ post)

    However, I have not worked with it yet, so I really cannot say much about it in practice.

    Twitter Engineering Blog Post: http://engineering.twitter.com/2011/08/storm-is-coming-more-details-and-plans.html
    Github: https://github.com/nathanmarz/storm