cluster-analysis recommendation-engine collaborative-filtering collective-intelligence

Approaches for Content-based Item Recommendations

I'm currently developing an application where I want to group similar items. Items (like videos) can be created by users and also their attributes can be altered or extended later (like new tags). Instead of relying on users' preferences as most collaborative filtering mechanisms do, I want to compare item similarity based on the items' attributes (like similar length, similar colors, similar set of tags, etc.). The computation is necessary for two main purposes: Suggesting x similar items for a given item and for clustering into groups of similar items.

My application so far is follows an asynchronous design and I want to decouple this clustering component as far as possible. The creation of new items or the addition of new attributes for an existing item will be advertised by publishing events the component can then consume.

Computations can be provided best-effort and "snapshotted", which means that I'm okay with the best result possible at a given point in time, although result quality will eventually increase.

So I am now searching for appropriate algorithms to compute both similar items and clusters. At important constraint is scalability. Initially the application has to handle a few thousand items, but later million items might be possible as well. Of course, computations will then be executed on additional nodes, but the algorithm itself should scale. It would also be nice if the algorithm supports some kind of incremental mode on partial changes of the data.

My initial thought of comparing each item with each other and storing the numerical similarity sounds a little bit crude. Also, it requires n*(n-1)/2 entries for storing all similarities and any change or new item will eventually cause n similarity computations.

Thanks in advance!

UPDATE tl;dr

To clarify what I want, here is my targeted scenario:

User generate entries (think of documents)
User edit entry meta data (think of tags)

And here is what my system should provide:

List of similar entries to a given item as recommendation
Clusters of similar entries

Both calculations should be based on:

The meta data/attributes of entries (i.e. usage of similar tags)
Thus, the distance of two entries using appropriate metrics
NOT based on user votings, preferences or actions (unlike collaborative filtering). Although users may create entries and change attributes, the computation should only take into account the items and their attributes, and not the users associated with (just like a system where only items and no users exist).

Ideally, the algorithm should support:

permanent changes of attributes of an entry
incrementally compute similar entries/clusters on changes
scale
something better than a simple distance table, if possible (because of the O(n²) space complexity)

Solution

Instead of writing from scratch take a look at mahout.apache.org. It has the clustering algorithms you are looking for as well as the recommendation algorithms. It works alongside Hadoop, so you can scale it out easily.

What this will allow you to do is determine similar documents in a cluster based on your keywords and/or description of the video.

https://cwiki.apache.org/MAHOUT/k-means-clustering.html

has a quick tutorial about clustering of documents using a Reuters dataset. It is quite similar to what you are trying to achieve. Mahout includes recommendation algorithms such as slope one, user based, item based and is incredibly easy to extend. It also has some pretty useful clustering algorithms which support dimension reduction features. This is useful for you in case your matrix is sparse (that is, a lot of tags that have very few usage stats).

Also take a look at Lucene to use its tfidf features to cluster tags and documents. Also check Solr. Both are Apache projects.