Search code examples
recommendation-enginesparse-matrixcollaborative-filtering

Recommendation system with large amount of data


I'm implementing a movie recommendation system with real user data. I planned to take the collaborative filtering method. However, this kind of methods usually involve a huge matrix storing the users their rated movies. Since I have more than ten thousand movies and hundred thousand users. It is impossible for me to create such a huge sparse matrix. I wonder how everyone implement collaborative filtering with such a large amount of data? Thanks!


Solution

  • I would recommend distributed computing frameworks to you, but, I think is still of a scale that you can easily handle it on one machine.

    Apache Mahout contains the Taste collaborative filtering library, which was designed to scale on one machine. A model of -- what, 10M data points? -- should fit in memory with a healthy heap size. Look at things like GenericItemBasedRecommender and FileDataModel.

    (Mahout also has distributed implementations based on Hadoop, but I don't think you need this yet.)

    I'm the author of that, but have since moved on to commercialize large-scale recommenders as Myrrix. It also contains a stand-alone single machine version, which is free and open source. It also will easily handle this amount of data on one machine. For example, this is a smaller data set than what's used in this example. Myrrix also has a distributed implementation.

    There are other fast distributed implementations beyond the above, like GraphLab. Other non-distributed frameworks are also probably fast enough, like MyMediaLite.

    I would suggest just using one of these, or if you really are just wondering "how" it happens, check into the source code and look at the data representation.