Search code examples
apache-sparkscikit-learnmahoutrecommendation-enginemahout-recommender

Content based recommendation in scale


This question is probably very repeated in the blogging and Q&A websites but I couldn't find any concrete answer yet.

I am trying to build a recommendation system for customers using only their purchase history.

  • Let's say my application has n products.
  • Compute item similarities for all the n products based on their attributes (like country, type, price)
  • When user needs recommendation - loop the previously purchased products p for user u and fetch the similar products (similarity is done in the previous step)

If am right we call this as content-based recommendation as opposed to collaborative filtering since it doesn't involve co-occurrence of items or user preferences to an item.

My problem is multi-fold:

  1. Is there any existing scalable ML platform that addresses contend based recommendation (I am fine to adopt different technologies/language)
  2. Is there a way to tweak Mahout to get this result?
  3. Is classification a way to handle content based recommendation?
  4. Is it something that a graph database good at solving?

Note: I looked at Mahout (since am familiar with Java and Mahout apparently utilizes Hadoop for distributed processing) for doing this in scale and advantage of having a well tested ML algorithms.

Your help is appreciated. Any examples would be really great. Thanks.


Solution

  • The so called item-item recommenders are natural candidates for precomputing the similarities, because the attributes of the items rarely change. I would suggest you precompute the item similarity between each item, and perhaps store the top K for each item, and if you have enough resources you could load the similarity matix into main memory for real time recommendation.

    Check out my answer to this question for a way to do this in Mahout: Does Mahout provide a way to determine similarity between content (for content-based recommendations)?

    The example is how to compute the textual similarity between the items, and than load the precomputed values into main memory.

    For performance comparison about different data structures to hold the values check out this question: Mahout precomputed Item-item similarity - slow recommendation