python hadoop mapreduce hadoop-streaming elastic-map-reduce

Are there any distributed machine learning libraries for using Python with Hadoop?

I have set myself up with Amazon Elastic MapReduce in order to preform various standard machine learning tasks. I have used Python extensively for local machine learning in the past and I do not know Java.

As far as I can tell there are no well developed Python libraries for distributed machine learning. Java on the other hand has Apache Mahout and the more recent Oryx from Cloudera.

Essentially it seems I have to choose between two options. Slog through parallelising my own algorithms to use with Hadoop streaming or one of the Python wrapper for Hadoop until decent libraries exist or jump ship to Java so that I can use Mahout/Oryx. There is a world of difference between writing your own MapReduce word count code and writing your own MapReduce SVM! Even with with help of great tutorials like this.

I don't know which is the wiser choice, so my question is:

A) Is there some Python library I have missed which would be useful? If not, do you know if there are any in development which will be useful in the near future?

B) If the answer to the above is no then would my time be better spent jumping ship to Java?

Solution

I do not know of any library that could be used natively in Python for machine learning on Hadoop, but an easy solution would be to use the jpype module, which basically allows you to interact with Java from within your Python code.

You can for example start a JVM like this:

from jpype import *

jvm = None

def start_jpype():
    global jvm
    if (jvm is None):
        cpopt="-Djava.class.path={cp}".format(cp=classpath)
        startJVM(jvmlib,"-ea",cpopt)
        jvm="started"

There is a very good tutorial on the topic here, which explains you how to use KMeans clustering from your Python code using Mahout.