Search code examples
mapreducetagscouchdbrdbmsintersection

Is it better to find most highly related posts by tag with CouchDB or RDBMS?


I want to find related posts by finding posts with the highest number of tags in common with the post in question. I can do this in a SQL query, but since I'll have thousands of posts with dozens of tags each I was wondering if CouchDB would be a better option? If so, how would it work in CouchDB?

Thinking about this problem relationally, here is the structure:

post: id, (other post data)
tag: id, (other tag data)
post_tag: post_id, tag_id

SQL to find the top 10 most related posts for post #X:

select r.post_id
from post_tag p join post_tag r on p.tag_id = r.tag_id
where p.post_id = X and r.post_id != X
group by r.post_id
order by count(*) desc
limit 10

This query gives the top 10 posts with the most tags in common with post X.

Can I get the same results from CouchDB more efficiently? How?


Solution

  • This query gives the top 10 posts with the most tags in common with post X.

    This kind of queries is horrible to compute with Map/Reduce, since, for every document, you would have to generate a key for every combination of tags. Moreover, you would need several chained map/reduces to achieve that efficiently. So CouchDB is definitely not the easy way to do that.

    As Marcin wrote, you could "cheat" and use a search server over CouchDB, but your SQL query seems to be more elegant.