mahout recommendation-engine mahout-recommender

What's difference between item-based and content-based collaborative filtering?

I am puzzled about what the item-based recommendation is, as described in the book "Mahout in Action". There is the algorithm in the book:

for every item i that u has no preference for yet
  for every item j that u has a preference for
    compute a similarity s between i and j
    add u's preference for j, weighted by s, to a running average
return the top items, ranked by weighted average

How can I calculate the similarity between items? If using the content, isn't it a content-based recommendation?

Solution

Item-Based Collaborative Filtering

The original Item-based recommendation is totally based on user-item ranking (e.g., a user rated a movie with 3 stars, or a user "likes" a video). When you compute the similarity between items, you are not supposed to know anything other than all users' history of ratings. So the similarity between items is computed based on the ratings instead of the meta data of item content.

Let me give you an example. Suppose you have only access to some rating data like below:

user 1 likes: movie, cooking
user 2 likes: movie, biking, hiking
user 3 likes: biking, cooking
user 4 likes: hiking

Suppose now you want to make recommendations for user 4.

First you create an inverted index for items, you will get:

movie:     user 1, user 2
cooking:   user 1, user 3
biking:    user 2, user 3
hiking:    user 2, user 4

Since this is a binary rating (like or not), we can use a similarity measure like Jaccard Similarity to compute item similarity.

                                 |user1|
similarity(movie, cooking) = --------------- = 1/3
                               |user1,2,3|

In the numerator, user1 is the only element that movie and cooking both has. In the denominator the union of movie and cooking has 3 distinct users (user1,2,3). |.| here denote the size of the set. So we know the similarity between movie and cooking is 1/3 in our case. You just do the same thing for all possible item pairs (i,j).

After you are done with the similarity computation for all pairs, say, you need to make a recommendation for user 4.

Look at the similarity score of similarity(hiking, x) where x is any other tags you might have.

If you need to make a recommendation for user 3, you can aggregate the similarity score from each items in its list. For example,

score(movie)  = Similarity(biking, movie) + Similarity(cooking, movie)
score(hiking) = Similarity(biking, hiking) + Similarity(cooking, hiking)

Content-Based Recommendation

The point of content-based is that we have to know the content of both user and item. Usually you construct user-profile and item-profile using the content of shared attribute space. For example, for a movie, you represent it with the movie stars in it and the genres (using a binary coding for example). For user profile, you can do the same thing based on the users likes some movie stars/genres etc. Then the similarity of user and item can be computed using e.g., cosine similarity.

Here is a concrete example:

Suppose this is our user-profile (using binary encoding, 0 means not-like, 1 means like), which contains user's preference over 5 movie stars and 5 movie genres:

         Movie stars 0 - 4    Movie Genres
user 1:    0 0 0 1 1          1 1 1 0 0
user 2:    1 1 0 0 0          0 0 0 1 1
user 3:    0 0 0 1 1          1 1 1 1 0

Suppose this is our movie-profile:

         Movie stars 0 - 4    Movie Genres
movie1:    0 0 0 0 1          1 1 0 0 0
movie2:    1 1 1 0 0          0 0 1 0 1
movie3:    0 0 1 0 1          1 0 1 0 1

To calculate how good a movie is to a user, we use cosine similarity:

                                 dot-product(user1, movie1)
similarity(user 1, movie1) = --------------------------------- 
                                   ||user1|| x ||movie1||

                              0x0+0x0+0x0+1x0+1x1+1x1+1x1+1x0+0x0+0x0
                           = -----------------------------------------
                                         sqrt(5) x sqrt(3)

                           = 3 / (sqrt(5) x sqrt(3)) = 0.77460

Similarly:

similarity(user 2, movie2) = 3 / (sqrt(4) x sqrt(5)) = 0.67082 
similarity(user 3, movie3) = 3 / (sqrt(6) x sqrt(5)) = 0.54772

If you want to give one recommendation for user i, just pick movie j that has the highest similarity(i, j).