algorithm tags graph-theory bookmarks data-analysis

how to extract group of nodes which has a certain amount of common child nodes

I'm solving a quiz and need some advices.

a summary of the quiz follows:

Analayze data of bookmarking service (like delicious, digg...) and extract the group of urls which has more than two common tags.

each bookmark data contains 1)user-id, 2)url, and 3)an array of tags.

size of all tags are relatively small compared to all urls. that is, people bookmark sites with limited set

All tags assigned to a URL are different

If different users bookmarked the same URL, you should not make group out of them.(However, this is a optional condition. You can just ignore user_id and suppose that all URLs are different.)

Example:

siteA - [tag1, tag2, tag3]
siteB - [tag1, tag2, tag4]
siteC - [tag1, tag3, tag5]
siteD - [tag1, tag2, tag6]

following two groups of URL will be the result

(siteA, siteB, siteD), (siteA, siteC)

because (siteA, siteB, siteD) share two common tags(tag1, tag2) and (siteA, siteC) also share two common tags (tag1, tag3) .

-- conditon 3,4 and an example added. thank you @btilly.

My question is

how it can be solved (or which algorithm can be applied) and practically fast?
is there any kind of representative problems that can be solved with similar algorithms to this question?

Solution

I would create a new data structure, which is by tag, a hash of the URLs that have that tag.

Then for each pair of tags you can take the one with less URLs, walk through them, and do a lookup to see if it is in the other, generating the group that share that pair of tags.

If you have n tags with an average of m urls per tag, it will take O(n * m) to generate the new data structure, and O(n * n * m) to generate the groups.