recommendation-engine collaborative-filtering

Decision between an user-based and item-based filter

I'm working on an algorithm to generate recommendations for a platform where you can review restaurants. So the database exists of 3 tables, 'Users', 'Restaurants' and 'Reviews'. Reviews have a rating of 1-5. Each restaurant can have multiple reviews, and users can have multiple reviews. A review can only have 1 user and 1 restaurant.

First I'll explain my current research/conclusions

I have implemented most of an user-based system, but I'm encountering several issues that come from this way of generating recommendations.

First of all, the data sparsity issue, because there is a possibility that users have a few reviews on very different restaurants, it's very difficult to determine a correlation between users.

Second I have encountered the cold-start problem. Before being able to calculate a correlation between users, you'll need around 4 reviews per review on exactly the same restaurants.
I have found a 'solution' for this by using a content-based filter. I personally don't like this way of filtering when there are better solutions (I.e. Item-based filtering)

And last, the scalability issue. Because (Hopefully) the app will become a huge success, it's going to be very heavy on the performance to calculate these correlations for each user.

Because of these issues I have decided to do some research on an item-based filter. I just would like some feedback to be sure my recommendations will be correct and to be sure I understand how this works fully.

First, you'll have to calculate a similarity correlation between restaurants, based on reviews left on said restaurants.

Example of a matrix (If I'm correct) where R1 = restaurant 1, etc.

   |  R1  |  R2  |  R3  |  R4  |
R1 |  1   | 0.75 | 0.64 | 0.23 |
R2 | 0.75 |  1   | 0.45 | 0.98 |
R3 | 0.64 | 0.45 |  1   | 0.36 |
R4 | 0.23 | 0.98 | 0.36 |  1   |

To generate recommendations for a user, you'll have to create a list of restaurants the user has reviewed positively.
Then check the matrix to see which restaurants are similar to these restaurants.
After this, you will have to check which of these restaurants the user hasn't reviewed yet. These will be recommendations for the user.

Using the item-based filter will solve most issues you'll encounter when using the user-based filter.

Data sparsity. Because you don't rely on similar users anymore, you won't have the problem of 'Not enough reviews'.
Cold start problem. Because you can base your recommendations on 1 review of the user, you won't have the cold start issues (Apart from having null data about the user)
Scalability. You don't have to generate the similarity matrix a lot. You can do this once a day or even week. And for generating the recommendations, you can simply consult the matrix and retrieve restaurants from that.

Now my questions:

I would really like some opinions. I'm not saying any of my research are facts, I'm just wondering if I'm doing things correctly.

Am I doing everything correctly? I have done plenty of research, but since every scenario is different, I find it difficult to determine if I'm doing the right thing.

Is it true that an item-based filter is much better than an user-based one?

How exactly would you calculate a similarity between restaurants? I understand the angle between vectors, but how do you determine the points on the vectors? You can't just take reviews and put them against other reviews, since then all highest rated restaurants will be very similar. How do you set up these vectors?

In my scenario, what would be the best solution, and what similarity coefficient would be best?

Also, what would happen when my review matrix looks like this?

       |  R1  |  R2  | 
User 1 |  ?   |  5   | 
User 2 |  ?   |  3   | 
User 3 |  5   |  ?   |  
User 4 |  3   |  ?   |

Is it possible to calculate a similarity between these two restaurants? And if not, how could I fix that?

Any feedback on this would be greatly appreciated!

Solution

You look like you have the right questions on your mind, I'll try to give you a few pointers and some suggestions:

Cold Start:

You won't be able to solve the cold start problem. Period. You can only mitigate it, an Item to Item approach suffers less of cold start but you still need the restaurants to have some reviews and the users to have at least one.

If you have acess to content information about users and restaurants then take advantage of it to make recommendations for when you don't have enough data.

This leads me to an interesting insight. You don't need to use the same algorithm to make all your recommendations. You can specify different approaches for users with different amounts of data.

For instance you can start by recommending the most popular restaurants in the local area to the user if you have the location, otherwise just use the most popular restaurants in your database.

Item 2 Item vs User 2 User vs Something Else:

I2I and U2U Collaborative filtering are recommendation systems algorithms that have achieved good results, however they suffer from all the problems you mentioned. I2I usually performs better but also suffers from cold start, scalability and the other problems.

There is another class of algorithms that have been outperforming I2I and U2U, and are also based on the same intuition of using the reviews to determine what items to recommend. Matrix Factorization Algorithms try to represent Users and items information with hidden factors and make recommendations based on those factors. You should definetly investigate it further.

Similarity Calculation:

The starting point would definetly be the Cosine Similarity, where each vector representing a restaurant is an array with the reviews of all users to that restaurant, and with 0s when the user hasn't reviewed the restaurant.

Detailed explaination with example here.

Scaling:

Even though I2I scales better it is still quadratic on the number of restaurants for both the memory and computational requirements. So you should investigate other options such as using Local Sensitive Hashing to calculate the similarities. I won't go into much detail as I am not very confortable with that algorithm, but I think you can apply it to calculate the most similar pairs without having to store the entire matrix.

Sources for further investigation:

A good introduction: Chapter 9 of Mining Massive Datasets
The Recommender Systems bible: Recommender Systems Handbook