I am in the process of evaluating Mahout as a collaborative-filtering-recommendation engine. So far it looks great. We have almost 20M boolean recommendations from 12M different users. According to Mahout's wiki and a few threads by Sean Owen, one machine should sufficient in this case. Because of that I decided to go with MySql as the data-model and skip the overhead of using Hadoop for now.
One thing eludes me though, what are the best practices for continuously updating the recommendations without reading the whole data from scratch? We have tens-of-thousands of new recommendations every day. While I do not expect it to be processed at real-time, I would like to have it processed every 15 minutes or so.
Please elaborate on the approaches for both a Mysql-based and Hadoop-based deployment. Thanks!
Any database is too slow to query in real-time, so any approach involves caching the data set in memory, which is what I assume you're already doing with ReloadFromJDBCDataModel
. Just use refresh()
to have it re-load at whatever interval you like. It should do so in the background. The catch is that it will need a lot of memory to load the new model while serving from the old one. You could roll your own solutions that, say, reload a user at a time.
There's no such thing as real-time updates on Hadoop. Your best bet there in general is to use Hadoop for full and proper batch computation of results, and then tweak them at run-time (imperfectly) based on new data in the app that is holding and serving recommendations.