Search code examples
mysqldatabasedata-warehousestar-schema

Creating a "DWH like" Star Schema in MySQL


I am pretty new to databases and SQL in general and have a pretty important task that I do not know how to solve.

Given that I have a Company (Company A) with three different subsidiaries (Sub A, Sub B, Sub C). All three subsidiaries produce the same products which can be divided into three product categories:

  • Product Category A
  • Product Category B
  • Product Category C

My task is to create a star scheme that allows me to get the following information: Total Revenues of Sub A and Sub B for every Product from Product Category B.

The needed format of the result is a table containing three columns "Subsidiary Name", "Product Name" and "Revenue".

What I have been trying to do so far, is to create the following Fact Table, Dimension and Keys. The Primary Keys of the Dimensions are connected to the Foreign Keys of my Fact Table.

Structure Approach

So my questions are:

  • Am I even on the right track or is this approach completely wrong?

  • How would I actually "fill" my tables with example data to be able to write a query that answers the given question above?

  • How would such a query look like?

For now, I am using a local MySQL server created with XAMPP and am connecting to this using IntelliJ IDEA Ultimate 2016. The diagramm above has also bee created using this software.

I hope that someone is able and willing to help me.

Also, please forgive any mistakes (or please tell me what I did wrong) as I am both new to SQL and especially new to Stackoverflow.

Kind regards

Edit:

I have used a different tool to create and visualize my tables. Also, as fenix mentioned, I have added several different Dimensions such as a Time/Date Table. However, in order to solve my task, I suppose I only need the two Dimensions dimCompany (Which represents the different subsidiaries) and dimProduct. enter image description here


Solution

    • You are on the track to it proper. School example of how data mart should be structured. Real-life example will possibly contain native keys of companyID, productID and salesID. Please add PK_sales or PK_fact (whatever you prefer). Event (one sale) without date or time-stamp is rarely to happen but if this is just some proof of concept, maybe you can go without it.
    • Do you have some data or you want to made up some data?
      1. You can create some example data inside excel using RANDBETWEENfunction.
      2. Many MySQL client tools have option to import data from excel, csv or txt files.
      3. Or you can use some of free data integration tools like Pentaho Kettle and Talend Open Studio. If it is too much at this moment avoid this step. However, it will give you good insight what DI tools are and how DWH could be populated (in real-life example).
    • It would be something like this:

    SELECT dp.ProductName,sum(fs.revenue) as total_revenue FROM fact_sales fs JOIN dim_product dp ON fs.FK_Product = dp.PK_Product JOIN dim_company dc ON fs.FK_Company = dc.PK_Company WHERE dp.Product_Type = 'Product Category B' AND dc.CompanyName IN ('Sub A','Sub B') GROUP BY dp.PK_Product;

    Unfortunately, some of Data Marts are not well structured so it doesn't contain referential integrity (anti-pattern), so you would have to go with LEFT JOIN instead of INNER JOIN. Note that INNER JOIN is same as JOIN. A bit about JOIN you can visualise here.

    Edit:

    Regarding your comment, it's hard to make good balance not to make this answer too broad because this QA site should serve for specific questions and answers.

    Since DWH is still a databases, it inherited most of databases principles. One of those are table keys. Primary Key is unique row identifier, and when database design allows (in DWH world it always should be) keys are integers (big integers exactly). See this for more info.

    Purpose? You can have two companies with same names but different states, so you need some key to distinct them.

    Why it should be integer? It's more technical question, and one of the answers is of query optimisation. CPU will match two integers (4 or 8 bytes) much faster rather than strings (VARCHARin database world) with 100+ bytes (or however you specify).

    Why someone uses VARCHAR as primary key? Well, sometimes, being as native as possible is good approach. So, VARCHAR primary key for product could be combination of product brand and counter (from supplier), and if you are re-seller, ERP could give Microsoft Office 2016 key like MCR-OFF-123. In DWH it is recommended to store native key (whatever it is, int or varchar) and to assign surrogate primary key.

    Sub A has the key '1', Sub B has the key '2

    quote from your comment.

    Yes, you're right.