I'm currently developing an application where I want to group similar items. Items (like videos) can be created by users and also their attributes can be altered or extended later (like new tags). Instead of relying on users' preferences as most collaborative filtering mechanisms do, I want to compare item similarity based on the items' attributes (like similar length, similar colors, similar set of tags, etc.). The computation is necessary for two main purposes: Suggesting x
similar items for a given item and for clustering into groups of similar items.
My application so far is follows an asynchronous design and I want to decouple this clustering component as far as possible. The creation of new items or the addition of new attributes for an existing item will be advertised by publishing events the component can then consume.
Computations can be provided best-effort and "snapshotted", which means that I'm okay with the best result possible at a given point in time, although result quality will eventually increase.
So I am now searching for appropriate algorithms to compute both similar items and clusters. At important constraint is scalability. Initially the application has to handle a few thousand items, but later million items might be possible as well. Of course, computations will then be executed on additional nodes, but the algorithm itself should scale. It would also be nice if the algorithm supports some kind of incremental mode on partial changes of the data.
My initial thought of comparing each item with each other and storing the numerical similarity sounds a little bit crude. Also, it requires n*(n-1)/2
entries for storing all similarities and any change or new item will eventually cause n
similarity computations.
Thanks in advance!
UPDATE tl;dr
To clarify what I want, here is my targeted scenario:
And here is what my system should provide:
Both calculations should be based on:
Ideally, the algorithm should support:
Instead of writing from scratch take a look at mahout.apache.org. It has the clustering algorithms you are looking for as well as the recommendation algorithms. It works alongside Hadoop, so you can scale it out easily.
What this will allow you to do is determine similar documents in a cluster based on your keywords and/or description of the video.
https://cwiki.apache.org/MAHOUT/k-means-clustering.html
has a quick tutorial about clustering of documents using a Reuters dataset. It is quite similar to what you are trying to achieve. Mahout includes recommendation algorithms such as slope one, user based, item based and is incredibly easy to extend. It also has some pretty useful clustering algorithms which support dimension reduction features. This is useful for you in case your matrix is sparse (that is, a lot of tags that have very few usage stats).
Also take a look at Lucene to use its tfidf features to cluster tags and documents. Also check Solr. Both are Apache projects.