Search code examples
c#asp.netsql-serversql-server-2005tag-cloud

Suggestion for a tag cloud algorithm


I have a MSSQL 2005 table:

[Companies](
    [CompanyID] [int] IDENTITY(1,1) NOT NULL,
    [Title] [nvarchar](128),
    [Description] [nvarchar](256),
    [Keywords] [nvarchar](256)
)

I want to generate a tag cloud for this companies. But I've saved all keywords in one column separated by commas. Any suggestions for how to generate tag cloud by most used keywords. There could be millions of companies approx ten keywords per company.

Thank you.


Solution

  • Step 1: separate the keywords into a proper relation (table).

    CREATE TABLE Keywords (KeywordID int IDENTITY(1,1) NOT NULL
      , Keyword NVARCHAR(256)
      , constraint KeywordsPK primary key (KeywordID)
      , constraint KeywordsUnique unique (Keyword));
    

    Step 2: Map the many-to-many relation between companies and tags into a separate table, like all many-to-many relations:

    CREATE TABLE CompanyKeywords (
       CompanyID int not null
       , KeywordID int not null
       , constraint CompanyKeywords primary key (KeywordID, CompanyID)
       , constraint CompanyKeyword_FK_Companies
          foreign key (CompanyID)
          references Companies(CompanyID)
       , constraint CompanyKeyword_FK_Keywords
          foreign key (KeywordID)
          references Keywords (KeywordID));
    

    Step 3: Use a simple GROUP BY query to generate the 'cloud' (by example taking the 'cloud' to mean the most common 100 tags):

    with cte as (
    SELECT TOP 100 KeywordID, count(*) as Count
    FROM CompanyKeywords
    group by KeywordID
    order by count(*) desc)
    select k.Keyword, c.Count
    from cte c
    join Keyword k on c.KeywordID = k.KeywordID;
    

    Step 4: cache the result as it changes seldom and it computes expensively.