Search code examples
c#.netsqlalgorithmtag-cloud

Looking for a faster algorithm to count tags/keywords/labels on a document database for a dynamic tagcloud


Current state

  • .NET 4.0 Application (WPF)
  • Database: SQLCE
  • Tables (simplified): Documents, Tags, DocumentsTags [n:n]
  • roughly 2000 documents and 600 tags (tags can be assigned to multiple documents)
  • tags = keywords = labels

Case

The user has a big document database, which he can filter with a tag cloud. The tags displays a name (the tag name itself) and a number, which is the total count of the documents with the respective tag. If the user selects a tag, only the documents with the selected tag are shown. The dynamic tag cloud now should show only the available tags on the filtered documents with an updated count number.

Problem

It is slow. After each selected tag, we need to evaluate again all the documents to count the tags. We currently do it recursively, so we check on each document what tags it has. We are looking for another solution (caching, better algorithm, your idea?).

Similarities

stackoverflow, del.icio.us also have tag clouds. Check out yourself. How do they do it? I know stored procedures would be a solution, but according our database developer this is not available on SQLCE.


Solution

  • You can use two inverted indexes, where each tag will be a key in both.

    One inverted index will actually be a map:Tags->list of Tags [all the tags that co-occure with the key]
    The second one will be map:Tags->list of Docs [all the documents that co-occure with each tag].

    Calculating the relevant set of docs after some tags were selected is simply an intersection on inverted index, that can be done efficiently.
    Also, finding the modified tags cloud is again an intersection on inverted index.

    Note that the inverted index can be created off-line, and creating it is a classic example of map-reduce usage.

    This thread discuss how to efficiently find intersection in inverted index