Search code examples
mahoutmahout-recommender

How mahout's recommendation evaluator works


Can anyone tell me how does mahout's RecommenderIRStatsEvaluator work? More specifically how it randomly splits training and testing data and what data the result is compare against? Based on my understating, you need some sort of ideal/expected result which you need to compare against actual result from the recommendation algorithm to find out TP or FP and thus compute precision or recall. But it looks like mahout provides a precision/recall score without that ideal/result.


Solution

  • The data is split into training and testing set using some relevance threshold value which you supply in the evaluate method of the RecommenderIRStatsEvaluator class. If this values is null there is method that computes it (computeThreshold). The class that splits the data into training and testing is GenericRelevantItemsDataSplitter. If you take a look into the code you can see that first the preferences for each user are sorted according the the value in descending order, and than only those that have value bigger than the relevanceThreshold are taken as relevant. Also notice that at most at are put into this set.

    @Override
      public FastIDSet getRelevantItemsIDs(long userID,
                                           int at,
                                           double relevanceThreshold,
                                           DataModel dataModel) throws TasteException {
        PreferenceArray prefs = dataModel.getPreferencesFromUser(userID);
        FastIDSet relevantItemIDs = new FastIDSet(at);
        prefs.sortByValueReversed();
        for (int i = 0; i < prefs.length() && relevantItemIDs.size() < at; i++) {
          if (prefs.getValue(i) >= relevanceThreshold) {
            relevantItemIDs.add(prefs.getItemID(i));
          }
        }
        return relevantItemIDs;
      }
    

    How the precision and the recall are computed you can see in the RecommenderIRStatsEvaluator.evaluate method. In short it is like this: First only one user is evaluated at a time. His preference values are split into relevant (as described above) and other. The relevant ones are used as test set, and the other together with all other users as training. Then top-at recommendations are produced for this user. Next the method looks whether some of the items that were taken aside as test set appear in the recommendation, and how many:

    int intersectionSize = 0;
          List<RecommendedItem> recommendedItems = recommender.recommend(userID, at, rescorer);
          for (RecommendedItem recommendedItem : recommendedItems) {
            if (relevantItemIDs.contains(recommendedItem.getItemID())) {
              intersectionSize++;
            }
      }
    

    The precision than is computed as follows:

    (double) intersectionSize / (double) numRecommendedItems
    

    Where numRecommendedItems is usually your at if the recommender produces at least at recommendations, otherwise smaller.

    Similar, the recall is computed as follows:

    (double) intersectionSize / (double) numRelevantItems
    

    where numRelevantItems is the number of items in the test set for this user.

    The final precision and recall are macro average of all precisions and recalls for all users.

    Hope this answers your question.

    EDIT: To continue with your question, it is very tricky when evaluating IR statistics (precision and recall) for recommender systems, especially if you have small number of user preferences per user. In this book you can find very useful insights. It says that

    it is typically assumed that the not liked items would have not been liked even if they had been recommended i.e they are uninteresting or useless for the user. However, this may not be true, because the set of not liked items may contains some interesting items that the user did not select. For example, a user may not have liked an item because he was unaware of its existence, but after the recommendation exposed that item, the user can decide to select it. In any case when using IR statistics, the number of the FP is over estimated.