Search code examples
rhadoopamazon-simpledbmahoutgoogle-bigquery

Recommendations using R with SimpleDB or BigQuery or using PHP with SimpleDB


I am currently working on system that generated product recommendations like those on Amazon : "People who bought this also bought this.."

Current Scenario:

  • Extract the Google Analytics data of the client and insert it in database.

  • On the website of the client, on load of product page the API call is made to get the recommendations of the product being viewed.

  • When API receives the product ID as request it looks in the database and retrieves (using association rules) the recommended product IDs and sends them as response.

  • The list of these product Ids will be processed to get the product details(image,price..) at the client end and displayed on website.

  • Currently I am using PHP and MYSQL with gapi package and REST api storage on AMAZON EC2 .

My Question is: Now, if I have to choose amongst the following, which will be the best choice to implement the above mentioned concept.

  • PHP with SimpleDB or BIGQuery.

  • R language with BIGQuery.

  • RHIPE-(R and hadoop ) with SimpleDB.

  • Apache Mahout.

Plese help!


Solution

  • This isn't so easy to answer, because the constraints are fairly specialized.

    The following considerations can be made, though:

    1. BIGQuery is not yet public. Thus, with a small usage base, even if you are in the preview population, it will be harder to get advice on improvement.
    2. Each of your answers asked about a modeling system & a storage system. Apache Mahout is not a storage mechanism, so it won't necessarily work on its own. I used to believe that its machine learning implementations were a a pastiche of a few Google Summer of Code, but I've updated that view on the suggestion of a commenter. It still looks like it has rather uneven and spotty coverage of different algorithms, and it's not particularly clear how the components are supported or maintained. I encourage an evangelist for Mahout to address this.

    As a result, this eliminates the 1st, 2nd, and 4th options.

    What I don't quite get is the need for a real-time server to utilize Hadoop and RHIPE. That should be done in your batch processing for developing the recommendation models, not in real-time. I suppose you could use RHIPE as a simple one-stop front end for firing off queries.

    I'd recommend using RApache instead of RHIPE, because you can get your packages and models pre-loaded. I see no advantage to using Hadoop in the front end, but it would be a very natural back end system for the model fitting.

    (Update 1) Other interface options include RServe (http://www.rforge.net/Rserve/) and possibly RStudio in server mode. There are R/PHP interfaces (see comments below), but I suspect it would be better to access R through HTTP or TCP/IP.

    (Update 2) Addressing the whole process, the basic idea I see is that you could query the data from PHP and pass to R or, if you wish to query from within R, look at the link in the comments (to the OmegaHat tools) or post a new question about R & SimpleDB - I'm sure someone else on SO would be able to give better insight on this particular connection. RApache would let you instantiate many R processes already prepared with packages loaded and data in RAM; thus you would only need to pass whatever data needs to be used for prediction. If your new data is a small vector then RApache should be fine, and it seems this is correct for the data being processed in real-time.