I'm coding a small Recommendation system for a school project where i'm supposed to look for similarity between items according to users behavior. I've already tried Mahout, and what's really impressive is the way it's able to compute similarity between two users or items extremely fast even on very large dataSets. I searched over the Mahout in Action book without finding an exact answer. The authors exhibit the names of the classes used and not the mechanisms. So, I tried following the same data representation, but when it comes to calculating the similarity between two items, it's very time consuming. I have two int
arrays of preferences for each user, and two other score arrays for scores.
I used hashing to determine the intersection between the userPreference
array and compute an Euclidean Similarity fast as shown in the code below, but without any success. I need Help please :(
Item item1 = (dataModel).getItem(item1_ID);
Item item2 = (dataModel).getItem(item2_ID);
int[] i1_users = item1.getUsersId();
int[] i2_users = item2.getUsersId();
float[] i1_scores = item1.getScore();
float[] i2_scores = item2.getScore();
IntFloatOpenHashMap tempHash = new IntFloatOpenHashMap();
for (int i= 0; i < i1_u.length; ++i)
tempHash.put(i1_users[i], i1_scores[i]);
for (int i = 0; i < i2_users.length; i++)
{
if (tempHash.containsKey(i2_users[i])) {
diff = tempHash.get(i2_users[i]) - i2_scores[i];
dist += diff * diff;
}
}
// return Math.sqrt(dist);
Regardless of what you are trying to calculate ( Item Similarity or User Similarity ) based on preference values, you can improve the speed of your code. You code has O(N^2) time complexity, but Mahout does it in O(N).
You can check the implementation:
userSimilarity
-- https://github.com/apache/mahout/blob/mahout-0.9/core/src/main/java/org/apache/mahout/cf/taste/impl/similarity/AbstractSimilarity.java#L110itemSimilarity
-- https://github.com/apache/mahout/blob/mahout-0.9/core/src/main/java/org/apache/mahout/cf/taste/impl/similarity/AbstractSimilarity.java#L225Basically you can iterate the two user id arrays in parallel, and also you can avoid creating tempHash
. You gain both in terms of space and time. I hope that helps.