I am a little bit confused with mahout: I have the impression that there are two ways to use it:
mahout recommenditembased --input input/recommend_data.csv --output output/recommendation --similarityClassname SIMILARITY_PEARSON_CORRELATION
as shown here.-> Is it correct or is it the same thing ?
My problem is: I have a csv input file with the following format: user_id, item_id, rating. I have 100 000 lines and I need to compute recommendations daily for all my users. I've read that it should be ok without hadoop, but it isn't: the .jar I have created works for small batches but not for the entire input file.
The command line method works in 5 min which is ok, but it's not as flexible as the jar project (above all for the interface with the MySQL database).
Is it possible to use a .jar and benefit from hadoop ? As I am not distributing any computation (hadoop runs on one server), is it normal to have such a difference between the .jar-without-mahout method, and the command-line-with-hadoop method ?
Many thanks for your help!
100000 lines is not a lot of data. I believe you don't need to use the distributed version of the recommendation algorithm, even if it's running in pseudo-distributed mode (only one machine).
You could easily make use of the API to build you own non-distributed recommender. Here's an example from the Mahout in Action book (which I recommend reading) [link]. In this case they are using a recommender based on similar users, and from what I see in your question you are using the one based on similar items.
To make one using item similarity you would need to use a ItemSimilarity
instead of the UserSimilarity
. Likewise, instead of a GenericUserBasedRecommender
you would use a GenericItemBasedRecommender
. Then, of course, you would iterate through all the users and ask for the recommendations for each one of them.
Hope this helps.