apache-spark pyspark recommendation-engine

Data format for Spark ALS recommendation system with implicit feedback

The ALS module in Spark assumes the data to be in form of (user, product, rating) tuples. When using implicitPrefs=True the ratings are assumed to be implicit ratings, so ratings equal to 0 have a special meaning and are not treated as unknown. As described by Hu et al (2008), the implicit ratings are used as weights by ALS. When using implicit ratings, the "missing" ratings need to be passed directly to the algorithms as zeros.

My question is: does ALS module needs user to provide the "missing" implicit ratings as zeros, or does it automatically populate the missing cells with zeros?

To give an example, say that I have three users, three products and their ratings (using (user, product, rating) format):

(1, 1, 2)
(1, 2, 1)
(2, 2, 3)
(3, 1, 1)
(3, 3, 2)

So user 1 did not rate product 3, user 2 did not rate neither 1, nor 2, etc. Can I pass this data directly to ALS? Or maybe, do I have to expand it to all the 3*3 possible combinations, where unrated products have ratings populated with zeros, i.e.

(1, 1, 2)
(1, 2, 1)
(1, 3, 0)
(2, 1, 0)
(2, 2, 3)
(2, 3, 0)
(3, 1, 1)
(3, 2, 0)
(3, 3, 2)

Solution

This might not be considered as an answer.

Of course you don't need to pass the missing ratings whether it's implicit or explicit.

One of the strength of spark is computing your prediction matrix using sparse matrices representation.

If you wish to know a little bit more about sparse matrices, you can check the following link :

What are sparse matrices used for ? What is its application in machine learning ?

Disclaimer: I'm the author of the answer in that link.