Search code examples
data-warehousedatabase-normalizationstar-schema

Does a data warehouse need to satisfy 2NF or another normal form?


I'm investigating data warehouses. And I have an issue about star schemas.

It's in
Oracle® OLAP Application Developer's Guide
10g Release 1 (10.1)
3.2.1 Dimension Table: TIME_DIM
https://docs.oracle.com/cd/B13789_01/olap.101/b10333/global.htm#CHDCGABE

To represent the hierarchy MONTH -> QUARTER -> YEAR, we need some keys such as: YEAR_ID, QUARTER_ID. But there are some things that I do not understand:

1) Why do we need field YEAR_DSC & QUARTER_DSC? I think that we can look up these values from YEAR & QUARTER TABLE. And it breaks 2NF.

2) What is the normal form that a schema in data warehouse needs to satisfy? (1NF, 2NF, 3NF, or any.)


Solution

  • NFs (normal forms) don't matter for data warehouse base tables.

    We normalize to reduce certain kinds of redundancy so that when we update a database we don't have to say the same thing in multiple places and so that we can't accidentally erroneously not say the same thing where it would need to be said in multiple places. That is not a problem in query results because we are not updating them. The same is true for a data warehouse's base tables. (Which are also just queries on its original database's base tables.)

    Data warehouses are usually optimized for reading speed, and that usually means some denormalization compared to the original database to avoid recomputation at the expense of space. (Notice though that sometimes rereading something bigger can be slower than reading smaller parts and recomputing the big thing.) We probably don't want to drop normalized tables when moving to a data warehouse, because they answer simple queries and we don't want to slow down by recomputing them. Other than those tradeoffs, there's no reason not to denormalize. Some particular warehouse design methods might have their own rules about what parts should be denormalized what amounts.

    (Whatever our original database design NF is chosen to be, we should always first normalize to 5NF then consciously denormalize. We don't need to normalize or know constraints to update or query a database.)

    Read some textbook basics on why we normalize & why we use data warehouses.