Architecture : Data Persistency , Search and Recommendation System

I am planing on a project where it involves data-persistency , searching capabilities and Recommendation feature(Collaborative filtering).

As shown in the diagram, I am thinking of :

1) Having set of micro-services to handle entities which will be persisted in NoSQL storage (probably MongoDb)

2) For searching function I will use Slor and messages coming from micro-services will be used to update the Slor index.

3) For recommendations , I am thinking of using Apache Mahout and use message-queue to update the Slor index used in Mahout

My questions are :

1) Is this the correct architecture to handle this kind of a problem?

2) Does it needs 3 data-storages : MongoDB for data persistance, Slor(Lucene index) for search and Solr(Lucene Index) used by mahout for Recommendations ?

3) Since Slor is also a NoSQL solution , what are the drawbacks of using Solr for both persistency and search functions without using MongoDB?

4) If I want to use Hadoop or Apache Sparks for analytics , this involves introducing another data-storage ?

Solution

This architecture seems reasonable. You can use the same Solr cluster for normal search as well as the Recommender search. If you want to write your own data input to Spark you might implement a method to instantiate the Mahout IndexedDataset from MongoDB. There is already a companion object for taking a PairRDD of (String, String) as a single event's input and creating an IndexedDataset. This would remove the need for HDFS.

Spark saves temp files but does not require HDFS for storage. If you are using AWS you could put the Spark retraining work onto EMR, to spin up for training, and tear down afterwards.

So the answers are:

Yes, it looks reasonable. You should always keep the event stream in some safe storage.
No, only MongoDB and Solr are needed as long as you can read from MongoDB to Spark. This would be done in the Recommender training code using Mahout's Spark code for SimilarityAnalysis.cooccurrence
No known downside, not sure of the performance or devops trade-offs.
You must use Spark for SimilarityAnalysis.cooccurrence from Mahout since it implements the new "Correlated Cross-occurrence" (CCO) algorithm that will greatly improve your ability to use different forms of user data that will in turn increase the quality of recommendations. Spark does not require HDFS storage if you feed in events using MongoDB or Solr.

BTW: ActionML helps with the Data Science part of this, we can help you determine which user information is most predictive. We created the first open source implementation of CCO. We have seen very large increases in quality of recommendation by including the right CCO data (much greater than the Netflix prize 10%). We also support the PredictionIO implementation of the above architecture. We wrote the Universal Recommender based on Mahout (I'm a Mahout committer), but it is much more turnkey than building the system from scratch, however our analysis help is independent of the implementation and might help you on the Data Science part of the project. ActionML.com, Universal Recommender here. All is free OSS.