sql sql-server relational-database database-performance redundancy

Redundancy vs. aggregated Data for performance

I have several code/value/synonyms lists (i.a. ICD codes) that include several validity periods, aggregated into versions (one per year).

So now I may choose the fully normalized approach, with following structure:

VERSIONS(id INT PRIMARY KEY, name VARCHAR)
CODES(id INT PRIMARY KEY, code VARCHAR)
VALUES(id INT PRIMARY KEY, text VARCHAR)

CODEVALUES(code_id INT FOREIGN KEY CODES.id, 
    value_id INT FOREIGN KEY VALUES.id, 
    version_id INT FOREIGN KEY VERSIONS.id,
    synonym_nr INT) 
    with PK(code_id, value_id, version_id)

This way I may have up to 14 records for a codevalue, that didn't change in last 14 years. For >14000 codes with up to 20 synonyms I end up with >2,000,000 records in CODEVALUES.

An alternative could be, to use an aggregated table, like

CODES(code VARCHAR, value VARCHAR, synonym_nr INT, min_version INT, max_version INT)

No FKs. For every combination of code/value/synonym_nr there's only one record.

I know about normalization, but I'm trying to reduce development- and administration-complexity since I need one OR/M entity for every SQL table, including it's relations and because I have dozens of such code lists and a factor 4 for class numbers is significant,

I wonder if there are performance differences between those alternatives.

UPDATE:

The queries on these lists are of the kind, that I look up a certain code with a specific version and want the default value for that code (synonym_nr = 0). As these queries are often part of larger queries, there may be several 10k up to 100k of such code lookups per query-transaction. With approach #1 I have at least 2 joins and the Db has to hold a mapping record for every version (redundancy for code/value). While approach #2 defines a valid version range, which a have to query via

WHERE version >= min_version AND version <= max_version

So it's joins and more records(index efficiency?) versus range comparisons in query constraints. Would there be a significant performance difference?

Solution

I'm completely with @SeanLange on this one;

It will save very little time up front and cost LOTS more in the long run.

Model properly now and you wont have to troubleshoot everyone else's queries later on.

Consider using smaller datatypes for your Version, Code and Value PKs, i.e. TINYINT or SMALLINT instead of INT if it's appropriate. Consider a view for your aggregated table and point your ORM at the view if you want to.

Alternatively, consider a different modelling approach. If the rate of change is low then using a 'from' and 'to' approach for the version numbers might be more compact.

Based on the way you've written your question I'm guessing you're at least reasonably competant with SQL Server. Try both approaches and look at the query plans for 'typical' queries to see how SQL Server handles the different approaches.