Search code examples
recommendation-enginecollaborative-filtering

recommendation engine metrics


I have been working on implementing a recommendation system through recommendations based on implicit feedback. Therefore, I am using the tuple (user,item, count) to create my user item matrix.

I did implement my recommendation system using this really nice example on the Insight data science blog : http://insightdatascience.com/blog/explicit_matrix_factorization.html

However, compared to the movie lens dataset, my dataset is incredibly sparse. In the example, 6.3% of the dataset is filled up, while that number comes out to be 0.30% for me. So, there are a lot of unknown values in my dataset. I have around 2900 users and 5000 items.

I have been training my model, and the training MSE refuses to come down. I have tried optimizing the paramters, but to no avail. I had the following questions:

(1) Is MSE not a reliable metric? I had gone through this discussion:https://www.quora.com/How-do-you-measure-and-evaluate-the-quality-of-recommendation-engines

However, A/B testing is not an option for me. My experience with machine learning models has always taught me that if the training MSE is stuck at a point, then it's a pretty bad thing (for a whole bunch of reasons)

So, am I not evaluating things properly?

(2) The cold start problem? I am initializing my user vectors and item vectors like this:

self.user_vectors = np.random.normal(size=(self.num_users,self.num_factors))

self.item_vectors = np.random.normal(size=(self.num_items,self.num_factors))

Is there something I can change here?

I am confused as to what to do next. The sparsity of matrix is very high and I know my algorithm is predicting values for a large number of the zeros. I intutitively feel that is keeping my MSE constant.

Any thoughts or direction would be really appreciated!

Thanks


Solution

  • (1) The MovieLens Dataset is an academic dataset and there is a clear choice of how they generate the dataset that makes it very different from a real life recommendation systems dataset. On the dataset's README the authors specify:

    Each user has rated at least 20 movies. 
    

    So their low RMSE is only applicable to users with this characteristic.

    I would suggest two metrics:

    1. Split your data into train and test set and compare your predictions with the test set, see if you predict a the (user,movie) pairs that are in your test set;
    2. Compare your with a baseline such as an average of the user and the movie mean.

    (2) I think you are a little confused about the cold start problem: It is a problem that affects Recommendation Systems (RS) that don't have data about a user or a movie. For instance if nobody has seen a movie you can't make reliable predictions about who will enjoy it. The same way for users, for someone who hasn't watched any movie you can't predict what movies they will like.

    One way to overcome the problem is to create a similarity measure between movies and between users, based on their characteristics (gender, age, country for users and genre, date and language for movies). With this you can make recommendations based on the most similar users and movies. These types of RS are said to be hybrid.

    Suggested papers: