Search code examples
data-warehouse

Multiple common tables in a snowflake data warehouse schema?


I'm trying to make a snowflake schema of a data warehouse, but turns out there are multiple types of data that would fit the center. For example: [factSales] has [order], [employee], [customer], [shipping], etc... then all the above mentioned also have [address]. Where can I exactly fit the [address] table? I want to avoid data duplication so putting the attributes in every other table is not an option.


Solution

  • The customer's address should be captured as fields on the Customer dimension, e.g. "Street name", "Post code", "City" and "Country". Similarly, the "shipping address" is (usually) a characteristic of the customer, so it should also be a set of fields in the Customer dimension.

    The "order address" depends on its functional meaning: it could be the address of the store or some other sales channel (e.g. website) where the order was placed, in which case the address would be a field in the Store/Channel dimension.

    So, if I understood correctly, your model's structure resembles something like:

    • Facts: SalesOrder
    • Dimensions: Customer, Employee, eventually Store/Channel

    The dimensions will have fields to store the appropriate addresses, e.g. the Customer dimension could have three sets of text fields to store each address: one for the home address, work address and shipping address.

    This could also be modeled by having a separate Address dimension which contains all possible addresses that exist in the data and is referenced by all other dimensions. In this case, for example, the Customer dimension could have three foreign keys (references) to the Address dimension, instead of the actual addresses as before: one for the home address, work address and shipping address. But the former approach is probably simpler to begin with.

    There is no problem with having duplicated data in a star schema, most dimensions that have hierarchies will have duplicated/redundant data by design, because the model is intentionally denormalized: https://en.wikipedia.org/wiki/Denormalization