Background:
I'm a software engineering student and I was checking out several algorithms for recommendation systems. One of these algorithms, a collaborative filtering has a lot of loops int it, it has to go through all of the users and for each user all of the ratings he has made on movies, or other rateable items.
I was thinking of implementing it on ruby for a rails app.
The point is there is a lot of data to be processed so:
Should this be done in the database? using regular queries? using PL/SQL or something similar (Testing dbs is extremely time consuming and hard, specially for these kind of algorithms )
Should I do a background job that caches the results of the algorithm? (If so the data is processed on memory and if there are millions of users, how well does this scale)
Should I run the algorithm every time there is a request or every x requests? (Again, the data is processed in memory)
The Question:
I know there are things that do this like Apache Mahout but they rely on Hadoop for scaling. Is there another way out? is there a Mahout or Machine Learning equivalent for ruby and if so how where does the computation take place?
Here is my thoughts on each of the methods:
No it should not. Some calculations would be much faster to run in your database and some would not. However it would be hard and time consuming to test exactly which calculations that should be runned in your db, and you would properly experience that some part of the algorithm is slow in postgreSQL or whatever you use. More importantly: this is not the right place to run logic, as you say yourself, it would be hard to test and it's overall a bad practice. It would also affect the performance of your requests overall each time the db have to calculate the algorithm. Also the db would still use a lot of memory processing this so that isn't a advantage.
By far the best solution. See below for more explanation.
This is a much better solution than number one. However this would mean that your apps performance would be very unstable. Some times all resources would be free for normal requests, and some times you would use all your resources on you calculations.
Option 2 is the best solution, as this doesn't interfere with the performance of the rest off your app and is much easier to scale as it works in isolation. If for example you experience that your worker can't keep up, you can just add some more running processes.
More importantly you would be able to run the background processes on a separate server and thereby easily monitor the memory and resource usage, and scale your server as necessary.
Even for real time updates a background job will be the best solution (if of course the calculation is not small enough to be done in the request). You could create a "high priority" queue that has enough resources to almost always be empty. If you need to show the result to the user with a reload, you would have to add some kind of push notification after a background job is complete. This notification could then trigger an update on the page through javascript (you can also check out the new live stream function of rails 4).
I would recommend something like Sidekiq with Redis. You could then cache the results in memcache or you could recalculate the result each time, that really depends on how often you would need to calculate this. With this solution, however, it would be much easier to setup a stable cache if you want it.
Where I work, we have an application that runs some heavy queries with a lot of calculations like this. Each night these jobs are queued and then run on a isolated server over the next few hours. This scales really well, and is also easy to monitor with new relic.
Hope this helps, and makes sense (I know my English isn't perfect), but please feel free to ask if I misunderstood something or you have more questions.