Search code examples
performancemahoutrecommendation-engine

Apache Mahout Performance Issues


I have been working with Mahout in the past few days trying to create a recommendation engine. The project I'm working on has the following data:

  • 12M users
  • 2M items
  • 18M user-item boolean recommendations
  • I am now experimenting with 1/3 of the full set we have (i.e. 6M out of 18M recommendations). At any configuration I tried, Mahout was providing quite disappointing results. Some recommendations took 1.5 seconds while other took over a minute. I think a reasonable time for a recommendation should be around the 100ms timeframe.

    Why does Mahout work so slow?
    I'm running the application on a Tomcat with the following JVM arguments (even though adding them didn't make much of a difference):

    -Xms4096M -Xmx4096M -da -dsa -XX:NewRatio=9 -XX:+UseParallelGC -XX:+UseParallelOldGC
    

    Below are code snippets for my experiments:

    User similarity 1:

    DataModel model = new FileDataModel(new File(dataFile));
    UserSimilarity similarity = new CachingUserSimilarity(new LogLikelihoodSimilarity(model), model);
    UserNeighborhood neighborhood = new NearestNUserNeighborhood(10, Double.NEGATIVE_INFINITY, similarity, model, 0.5);
    recommender = new GenericBooleanPrefUserBasedRecommender(model, neighborhood, similarity);
    

    User similarity 2:

    DataModel model = new FileDataModel(new File(dataFile));
    UserSimilarity similarity = new CachingUserSimilarity(new LogLikelihoodSimilarity(model), model);
    UserNeighborhood neighborhood = new CachingUserNeighborhood(new NearestNUserNeighborhood(10, similarity, model), model);
    recommender = new GenericBooleanPrefUserBasedRecommender(model, neighborhood, similarity);
    

    Item similarity 1:

    DataModel dataModel = new FileDataModel(new File(dataFile));
    ItemSimilarity itemSimilarity = new LogLikelihoodSimilarity(dataModel);
    recommender = new GenericItemBasedRecommender(dataModel, itemSimilarity);
    

    Solution

  • With the gracious help of the Mahout community via its mailing list, we have found a solution to my problem. All of the code related to the solution was committed into Mahout 0.6. More details can be found in the corresponding JIRA ticket.

    Using VisualVM I found that the performance bottleneck was in the computation of item-item similarities. This was addressed by @Sean using a very simple but effective fix (see the SVN commit for more details)

    Additionally, we have discussed how to improve the SamplingCandidateItemsStrategy to allow finer control over the sampling rate.

    Finally, I did some testing with my application with the aforementioned fixes. All the recommendations took less than 1.5 seconds with the overwhelming majority taking less than 500ms. Mahout could easily handle 100 recommendations per second (I did not try to stress it more than that).