Search code examples
mahoutamazon-emr

Vectorizing a solr index with mahout using lucene.vector


I'm trying to run a clustering job on Amazon EMR using Mahout. I have a solr index that I uploaded on S3 and I want to vectorize it using mahouts lucene.vector.(this is the first step in the job flow)

The parameters for the step are:

  • Jar: s3n://mahout-bucket/jars/mahout-core-0.6-job.jar
  • MainClass: org.apache.mahout.driver.MahoutDriver
  • Args: lucene.vector --dir s3n://mahout-input/solr_index/ --field name --dictOut /test/solr-dict-out/dict.txt --output /test/solr-vectors-out/vectors

The error in the log is:

Unknown program 'lucene.vector' chosen.

I've done the same process locally with hadoop and Mahout and it worked fine. How should I call the lucene.vector function on EMR?


Solution

  • I've eventually figured out the answer. The problem was I was using the wrong MainClass argument. Instead of

    org.apache.mahout.driver.MahoutDriver
    

    I should have used:

    org.apache.mahout.utils.vectors.lucene.Driver
    

    Therefore the correct arguments should have been

    • Jar: s3n://mahout-bucket/jars/mahout-core-0.6-job.jar MainClass:
    • org.apache.mahout.utils.vectors.lucene.Driver
    • Args: --dir s3n://mahout-input/solr_index/ --field name --dictOut /test/solr-dict-out/dict.txt --output /test/solr-vectors-out/vectors