Search code examples
hadoopmahoutrecommendation-engineemr

Mahout - ParallelALSFactorizationJob running too long?


I am trying to run Mahout ALS recommendation on AWS EMR cluster, however, it takes much longer than I expected.

The following is the command I run:

aws add-steps --cluster-id <cluster_id> \
              --steps Type=CUSTOM_JAR,\
                      Name="Mahout ALS Factorization Job",\ 
                      Jar=s3://<my_bucket>/recproto/mahout-mr-0.10.0-job.jar,\
                      MainClass=org.apache.mahout.cf.taste.hadoop.als.ParallelALSFactorizationJob,\
                      Args=["--input","s3://<my_bucket>/recproto/trainingdata/userClicks.csv.gz",\
                            "--output","s3://<my_bucket>/recproto/als-output/",\
                            "--implicitFeedback","true",\
                            "--lambda","150",\
                            "--alpha","0.05",\
                            "--numFeatures","100",\
                            "--numIterations","3",\
                            "--numThreadsPerSolver","4",\
                            "--usesLongIDs","true"]

In the userClicks.csv file, there are 1,567,808 ratings from 335,636 users and 23,934 items.

The job is run on a 10-c3.xlarge nodes EMR cluster, and the job has been running for more than 2 hours. I would like to know is this normal? In the case of my rating file, which scale of EMR cluster and parameters should I use so I can get a more acceptable running time?


Solution

  • I solved this problem by simply use Spark ALS. The training process spends LESS THAN 2 MINUTES ON MY LAPTOP on the same dataset with the same parameters.

    I can now understand why some machine learning algorithms are deprecated due to performance issues...(e.g., the Minhash algorithm)