Search code examples
javaclojureparallel-processingdistributed-computing

Distributed computing framework for Clojure/Java


I'm developing an application where I need to distribute a set of tasks across a potentially quite large cluster of different machines.

Ideally I'd like a very simple, idiomatic way to do this in Clojure, e.g. something like:

; create a clustered set of machines
(def my-cluster (new-cluster list-of-ip-addresses))

; define a task to be executed
(deftask my-task (my-function arg1 arg2))

; run a task 10000 times on the cluster
(def my-job (run-task my-cluster my-task {:repeat 10000})

; do something with the results:
(some-function (get-results my-job))

Bonus if it can do something like Map-Reduce on the cluster as well.....

What's the best way to achieve something like this? Maybe I could wrap an appropriate Java library?

UPDATE:

Thanks for all the suggestion of Apache Hadoop - looks like it might fit the bill, however it seem a bit like overkill since I'm not needing a distributed data storage system like Hadoop uses (i.e. i don't need to process billions of records)... something more lightweight and focused on compute tasks only would be preferable if it exists.


Solution

  • Hadoop is the base for almost all the large scale big data excitement in the Clojure world these days though there are better ways than using Hadoop directly.

    Cascalog is a very popular front end:

        Cascalog is a tool for processing data on Hadoop with Clojure in a concise and
        expressive manner. Cascalog combines two cutting edge technologies in Clojure 
        and Hadoop and resurrects an old one in Datalog. Cascalog is high performance, 
        flexible, and robust.
    

    Also check out Amit Rathor's swarmiji distributed worker framework build on top of RabbitMQ. it's less focused on data processing and more on distributing a fixed number of tasks to a pool of available computing power. (P.S. It's in his book, Clojure in Action)