etl data-warehouse business-intelligence

Data warehouse modelling

Let’s say I have 6 databases, multiple industries, mostly similar schema.

Currently we have dozens of excel files that connect to each DB and query live data (work orders, invoices etc).

As I understand. Creating a separate DW database would be of benefit performance wise, but also when remodelled, would eliminate complex joins currently required by our QA people.

Would it stand to reason, that I would have a denormalized table, called ‘WorkOrder’ which was also a merger of all work orders across 5-6 systems? How would I handle the primary keys of each workorder, when they overlap? I assume a distinct column for each with a unique prefix to designate the origin database?

Should the workorder table contain only common fields, or would all fields make more sense, nulling out those fields where data didn’t exist in the original?

This denormalized table would be easier to query from the QA standpoint, no doubt. But seems to contradict what I’ve read about DW star or snowflake modelling with facts etc?!?

It’s very likely I just don’t get the fundamentals of data warehousing either :)

Solution

Having decided that you need a Data Warehouse, the first decision you need to take is what type of design/database you are going to use. There are quite a few options (Kimball, Inmon, Data Vault, NoSQL, Graph, etc.) but the vast majority of data warehouses follow the basic Kimball Methodology of dimensional modelling e.g. Facts and Dimensions.

If you are going to build a Kimball-style data warehouse (or follow any other methodology) then my first recommendation would be to employ someone with experience who can lead the work. It is very easy to make mistakes when designing a DW but very hard to correct them once people are using it, have built reports against it, etc.

If you're not going to employ someone who knows what they are doing then the next best option is to go on a course and/or read books on the subject. For Kimball, there are really 2 books that should be required reading:

The Data Warehouse Lifecycle Toolkit : this talks you through all the components involved and the steps to follow in order to deliver a robust data warehouse
The Data Warehouse Toolkit : this goes through the steps to design a dimensional model

Once you have read and understood these 2 books you will be better placed to understand the terminology and ask specific, focussed questions about any parts of the methodology (or your specific circumstances) that you don't understand.

This is absolutely not meant to be a criticism but from your questions it is very clear that you don't (yet) have the knowledge or experience to be designing and building a data warehouse - and you're not going to be able gain that experience by asking questions on this (or any other) forum.