Search code examples
mahoutrecommendation-enginemahout-recommender

What's difference between item-based and content-based collaborative filtering?


I am puzzled about what the item-based recommendation is, as described in the book "Mahout in Action". There is the algorithm in the book:

for every item i that u has no preference for yet
  for every item j that u has a preference for
    compute a similarity s between i and j
    add u's preference for j, weighted by s, to a running average
return the top items, ranked by weighted average

How can I calculate the similarity between items? If using the content, isn't it a content-based recommendation?


Solution

  • Item-Based Collaborative Filtering

    The original Item-based recommendation is totally based on user-item ranking (e.g., a user rated a movie with 3 stars, or a user "likes" a video). When you compute the similarity between items, you are not supposed to know anything other than all users' history of ratings. So the similarity between items is computed based on the ratings instead of the meta data of item content.

    Let me give you an example. Suppose you have only access to some rating data like below:

    user 1 likes: movie, cooking
    user 2 likes: movie, biking, hiking
    user 3 likes: biking, cooking
    user 4 likes: hiking
    

    Suppose now you want to make recommendations for user 4.

    First you create an inverted index for items, you will get:

    movie:     user 1, user 2
    cooking:   user 1, user 3
    biking:    user 2, user 3
    hiking:    user 2, user 4
    

    Since this is a binary rating (like or not), we can use a similarity measure like Jaccard Similarity to compute item similarity.

                                     |user1|
    similarity(movie, cooking) = --------------- = 1/3
                                   |user1,2,3|
    

    In the numerator, user1 is the only element that movie and cooking both has. In the denominator the union of movie and cooking has 3 distinct users (user1,2,3). |.| here denote the size of the set. So we know the similarity between movie and cooking is 1/3 in our case. You just do the same thing for all possible item pairs (i,j).

    After you are done with the similarity computation for all pairs, say, you need to make a recommendation for user 4.

    • Look at the similarity score of similarity(hiking, x) where x is any other tags you might have.

    If you need to make a recommendation for user 3, you can aggregate the similarity score from each items in its list. For example,

    score(movie)  = Similarity(biking, movie) + Similarity(cooking, movie)
    score(hiking) = Similarity(biking, hiking) + Similarity(cooking, hiking) 
    

    Content-Based Recommendation

    The point of content-based is that we have to know the content of both user and item. Usually you construct user-profile and item-profile using the content of shared attribute space. For example, for a movie, you represent it with the movie stars in it and the genres (using a binary coding for example). For user profile, you can do the same thing based on the users likes some movie stars/genres etc. Then the similarity of user and item can be computed using e.g., cosine similarity.

    Here is a concrete example:

    Suppose this is our user-profile (using binary encoding, 0 means not-like, 1 means like), which contains user's preference over 5 movie stars and 5 movie genres:

             Movie stars 0 - 4    Movie Genres
    user 1:    0 0 0 1 1          1 1 1 0 0
    user 2:    1 1 0 0 0          0 0 0 1 1
    user 3:    0 0 0 1 1          1 1 1 1 0
    

    Suppose this is our movie-profile:

             Movie stars 0 - 4    Movie Genres
    movie1:    0 0 0 0 1          1 1 0 0 0
    movie2:    1 1 1 0 0          0 0 1 0 1
    movie3:    0 0 1 0 1          1 0 1 0 1
    

    To calculate how good a movie is to a user, we use cosine similarity:

                                     dot-product(user1, movie1)
    similarity(user 1, movie1) = --------------------------------- 
                                       ||user1|| x ||movie1||
    
                                  0x0+0x0+0x0+1x0+1x1+1x1+1x1+1x0+0x0+0x0
                               = -----------------------------------------
                                             sqrt(5) x sqrt(3)
    
                               = 3 / (sqrt(5) x sqrt(3)) = 0.77460
    

    Similarly:

    similarity(user 2, movie2) = 3 / (sqrt(4) x sqrt(5)) = 0.67082 
    similarity(user 3, movie3) = 3 / (sqrt(6) x sqrt(5)) = 0.54772
    

    If you want to give one recommendation for user i, just pick movie j that has the highest similarity(i, j).