I'm solving a quiz and need some advices.
a summary of the quiz follows:
Analayze data of bookmarking service (like delicious, digg...) and extract the group of urls which has more than two common tags.
- each bookmark data contains 1)user-id, 2)url, and 3)an array of tags.
- size of all tags are relatively small compared to all urls. that is, people bookmark sites with limited set
- All tags assigned to a URL are different
- If different users bookmarked the same URL, you should not make group out of them.(However, this is a optional condition. You can just ignore user_id and suppose that all URLs are different.)
Example:
siteA - [tag1, tag2, tag3]
siteB - [tag1, tag2, tag4]
siteC - [tag1, tag3, tag5]
siteD - [tag1, tag2, tag6]
following two groups of URL will be the result
(siteA, siteB, siteD), (siteA, siteC)
because (siteA, siteB, siteD) share two common tags(tag1, tag2) and (siteA, siteC) also share two common tags (tag1, tag3) .
-- conditon 3,4 and an example added. thank you @btilly.
My question is
I would create a new data structure, which is by tag, a hash of the URLs that have that tag.
Then for each pair of tags you can take the one with less URLs, walk through them, and do a lookup to see if it is in the other, generating the group that share that pair of tags.
If you have n
tags with an average of m
urls per tag, it will take O(n * m)
to generate the new data structure, and O(n * n * m)
to generate the groups.