Search code examples
javamahout

Mahout : Normalizing UserSimilarity distances


I have a model as such (non-Hadoop) :

DataModel data = new FileDataModel(new File("file.csv"));
UserSimilarity userSimilarity = new PearsonCorrelationSimilarity(dataModel);
userSimilarity.setPreferenceInferrer(new AveragingPreferenceInferrer(data));
UserNeighborhood userNeighborhood = new NearestNUserNeighborhood(1, userSimilarity, data);

userSimilarity is not normalized between [0,100] for instance, so if I want to display it to end users, I use the following solution :

long maxSim = userSimilarity.userSimilarity(userId1, userNeighborhood.getUserNeighborhood(userId1)[0]);
long finalSimilarity = Math.min(100, Math.max((int) Math.ceil(100 * userSimilarity.userSimilarity(userId1, userId2) / maxSim), 0))

I observed performance issues with this (various seconds for each user), is there another possibility, or quickest way to have min(similarity) = 0 and max(similarity) = 100 for each given user?


Solution

  • Your performance problem has nothing to do with your normalization, and everything to do with the rest fo the calculation.

    I would not use AveragingPreferenceInferrer by the way. It slows things down and rarely helps. You may also find it faster to simply loop over all users and compute similarity to find the most similar one. Computing a neighborhood of 1 is about the same but a little closer.

    Pearson correlation is in [-1,1]. If you want it in a range of [0,100], simply use 50*(1+correlation).