Search code examples
javaperformancemahoutrecommendation-engine

Mahout: (fast performance) how to write preferences to file?


I have a training dataset of 800.000 records from 6000 users rating 3900 movies. These are stored in a comma separated file like: userId,movieId,preference. I have another dataset (200.000 records) in the format: userId,movieId. My goal is to use the first dataset as a training-set, in order to determine the missing preferences of the second set.

So far, I managed to load the training dataset and I generated user-based recommendations. This is pretty smooth and doesn't take too much time. But I'm struggling when it comes to writing back the recommendations.

The first method I tried is:

  • read a line from the file and get the userId,movieId tuple.
  • retrieve the calculated preference with estimatePreference(userId, movieId)
  • append the preference to the line and save it in a new file

This works, but it's incredibly slow (I added a counter to print every 10.000th iteration: after a couple of minutes it had only printed once. I have 8GB-RAM with an i7-core... how long can it take to process 200.000 lines?!)

My second choise was:

  • create a new FileDataModel with the second dataset
  • do something like this:

    newDataModel.setPreference(userId, movieId, recommender.estimatePreference(userId, movieId));

Here I get several problems:

  1. at runtime: java.lang.UnsupportedOperationException (as I found here, FileDataModel actually can't be updated. I don't know why the function setPreference exists in the first place...)
  2. The API of setPreference states "This method should also be considered relatively slow."

I read around that a solution would be to use delta files, but I couldn't find out what that actually means. Any suggestion on how I could speed up my writing-the-preferences process?

Note that I'm new to Mahout and to recommender systems, so please use layman terms ;)


Solution

  • Are you sure that the problem is writing the results? It seems to me that the real problem is the use of a user-based recommender.

    For such a small data set, for instance, a search-based recommender will be able to make recommendations in less than a millisecond with multiple recommendations possible in parallel. This should allow you to do 200,000 recommendations in a few minutes on a single machine.

    With such a small dataset, indicator-based methods may not be the best option. To improve that, try using something larger such as the million song dataset. See http://labrosa.ee.columbia.edu/millionsong/

    Also, using and estimating ratings is not a particularly good thing to be doing if you want to build a real recommender.

    Finally, questions about Mahout are much better addressed to the Mahout mailing list itself.