I'm curious about what be the best way to model this for optimized performance... not as concerned about real time data integrity
I'll continue with the stackoverflow example
Question
id
title
Votes
id
user
question
A question has many votes
For many queries however, we're only concerned with the aggregate number of votes (e.g. to show next to the question).
Good relational db theory would create the two entities (Q and V) as separate relations, requiring a join then a sum or count aggregate call.
Another possibility is to break normal form and occasionally materialize the aggregate value of votes as an attribute in Question (e.g. Question.votes). Performance is gained on reads, however, depending on how stale you are willing to let your "votes" data get, it requires a lot more rights to that Question record... in turn hindering performance.
Other techniques involving caching, etc. can be used. But I'm just wondering, performance wise what's the best solution? Let's say the site is higher traffic and receiving a considerable more amount of votes than questions.
Open to non-relational models as well.
It's unlikely that a join will be too slow in this case, especially if you have an index on (question) in the Votes table.
If it is REALLY too slow, you can cache the vote count in the Question table:
id - title - votecount
You can update the votecount whenever you record a vote. For example, from a stored procedure or directly from your application code.
Those updates are tricky, but since you're not that worried about consistency, I guess it's ok if the vote is sometimes not exactly right. To fix any errors, you can periodically regenerate all cached counts like:
UPDATE q
SET votecount = count(v.question)
FROM questions q
LEFT JOIN votes v on v.question = q.id
The aggregate count(v.question) returns 0 if no question was found, as opposed to count(*), which would return 1.
If locks are an issue, consider using "with (nolock)" or "set transaction isolation level read uncommited" to bypass locks (again, based on data integrity being a low priority.)
As an alternative to nolock, consider "read committed snapshot", which is meant for databases with heavy read and less write activity. You can turn it on with:
ALTER DATABASE YourDb SET READ_COMMITTED_SNAPSHOT ON;
It is available for SQL Server 2005 and higher. This is how Oracle works by default, and it's what stackoverflow itself uses. There's even a coding horror blog entry about it.