Search code examples
hadoopmachine-learningbigdatamahoutmahout-recommender

Need clarification about usage of mahout with hadoop


I currently have an implementation of a recommender in mahout using the in memory recommendation apis. However, I would like to move to a distributed solution using hadoop in order to calculate offline recommendations. This is my first time using hadoop and I'm looking for clarification on a few concepts and api usages.

Currently, my understanding of hadoop is minimal and I think that the correct approach is the following:

  • use something like apache drill in order to populate the hdfs with the user and item data.

  • using the recommendation job in mahout train on the data from the hdfs.

  • transform the resultant data in the hdfs to index shards to be used by solr

  • use solr to provide the recommendations to the userbase

However, I am looking for clarifications on a couple aspects of this design:

  1. How would I utilize a rescorer in the manner that it is used in the in memory live recommendations?

  2. What is the best manner in which to invoke the recommendations job?

I have other questions besides these two but the answers to these would be a huge help.


Solution

  • You may be talking about the Mahout + Hadoop + Solr recommender. This method handles rescoring in a couple different ways.

    The basic recommender can be put together in two ways:

    1. After getting data into into HDFS in the form of (user id, item id, preference weight) run the ItemSimilarityJob on the data (use LLR similarity, which is usaully best). It will create what is called an indicator matrix. This will be an item id by item id sparse matrix of values indicating the similarity magnitude between any two items. You must then convert this into values that Solr can index. That means translating the internal Mahout integer IDs into some unique string representation, which is probably what they were at the very beginning. This will look like (item123,item223 item643 item293 item445...) as a CSV. so two Solr fields, the first is an item id, the second is a list of similar items. All ids must be text tokens. Then the query for recommendations is a Solr query made up of item ids that a particular user has shown a preference for. So query = "item223 item344 item445...". Make the query against the filed that olds the indicator matrix values. You will get back an ordered list of item IDs
    2. A much easier way that may work for you is to use a tool in the /examples folder of Mahout 1.0-SNAPSHOT or here: https://github.com/pferrel/solr-recommender. It takes in raw log files with unique strings for user and item ids. it does all the work on Hadoop to output CSVs that can be indexed by Solr directly or loaded into a DB as described above.

    The way I did the demo site (https://guide.finderbots.com) is to use my Solr web app integration, putting the indicator matrix into a DB attaching the similar item list to my collection of items. So item123 got item223 item643 item293 item445... in its indicator field. After you index the collection the query is then = "item223 item344 item445..." -- the user's prefered items.

    Here are three ways to do rescoring:

    1. Mix in metadata with the query. So you could do query = "item223 item344 item445..." against the indicator field AND "SciFi" against the "genre" field. This gives you blended collaborative filtering and metadata in your query and as you can imagine, the recs are based on the user's prefs but skewed towards "SciFi". There are lots of other interesting things you can do once you get item+indicators+metadata into an index.
    2. Filter recs by metadata. You can get recs not skewed but filtered, if you want. Using the Solr query = "item223 item344 item445..." against the indicator field AND "SciFi" as a filter against the "genre" field. In this case you get nothing but "SciFi" where #1 you would get mostly "SciFi"
    3. Get your ordered list of recs back and rescore them in any way you'd like based on other things you know about the user, context, or items. Often these can be encoded into a Solr query and done with one query but reordering and filtering can be done after the recs are returned too. You would have to write that code, it is not built in.

    The fun thing is you can mix filters, metadata fields, and user preferences with what Solr calls "boost" values to get all sorts of rescoring. Solr can even use location to query, skew, or filter.

    Note: You don't have to worry about Solr shards necessarily. Solr will index most DBs and HDFS directly but only the index is sharded. You shard an index if you have a very big one, you replicate it if you have lots of queries/second (or for failover). Solr queries are generally very fast so I'd worry about that after you have a functioning system since it's a config thing and shouldn't be affected by the rest of your workflow.