I'm trying to run a clustering job on Amazon EMR using Mahout. I have a solr index that I uploaded on S3 and I want to vectorize it using mahouts lucene.vector.(this is the first step in the job flow)
The parameters for the step are:
The error in the log is:
Unknown program 'lucene.vector' chosen.
I've done the same process locally with hadoop and Mahout and it worked fine. How should I call the lucene.vector function on EMR?
I've eventually figured out the answer. The problem was I was using the wrong MainClass argument. Instead of
org.apache.mahout.driver.MahoutDriver
I should have used:
org.apache.mahout.utils.vectors.lucene.Driver
Therefore the correct arguments should have been