Search code examples
sqlt-sqldata-warehousecube

Remove duplicates from fact table to calculate measure correctly


I'm very new to data warehousing and dimensional modelling. For a uni project I started out with a database that I need to turn into a data warehouse for analyses. To end up with a clean star schema, I had to denormalize a few tables into 1 fact table. The downside to this is the amount of redundancy.

Below is a part of the data from the fact table:

enter image description here

A voyage consists of multiple shipments, and a shipment can consist of multiple different items. In this example, containers 1-2000 of shipment 1 contain item 3, and containers 2001-5000 contain item 1. The total amount of containers for this shipment is 5000, obviously. However, for data analysis purposes, I need to calculate a measure for the total amount of containers per voyage. This presents a problem with the current fact table, because I have a record for each different item. So for voyage 1, the actual total amount should be 9200, but because of the duplication I'll end up with 19400, leading to an incorrect measure.

I need to find a way to get rid of the duplicates in my calculation, but I can't find a way to do so. Any help would be much appreciated.


Solution

  • What you'll need to do is group by your shipments (CTE, inner query, temp table, etc) to get the number of containers per shipment, then group by your voyages to get the number of containers per voyage.

    Here's an example with an inner query:

    SELECT voyage_id, SUM(num_ship_containers) AS num_voyage_containers
    FROM (
      SELECT voyage_id, shipment_id, MAX(container_end) AS num_ship_containers
      FROM ShippingWarehouse
      GROUP BY voyage_id, shipment_id
    ) AS ship_data
    GROUP BY voyage_id;
    
    voyage_id num_voyage_containers
    1 9200

    Try it out!