I'm new to data warehousing so apologies if this is super basic, but I'm curious about this concept.
Example: say I have a table that stores aggregated analytics for a website, (e.g. the total pageviews for a url on a given date).
dbo.PageFacts
Date | Url_Id | Page_Views |
---|---|---|
2020-01-01 | 1 | 280 |
2020-01-01 | 2 | 50 |
2020-01-02 | 3 | 10 |
Now let's say I'm tasked with adding a new "Device_Id" column to this table (foreign keyed to its own dimension table), and now the table is split out like this...
Date | Url_Id | Device_Id | Page_Views |
---|---|---|---|
2020-01-01 | 1 | 101 | 180 |
2020-01-01 | 1 | 102 | 100 |
2020-01-01 | 2 | 101 | 50 |
2020-01-02 | 3 | 101 | 10 |
How do most people go about doing this for a fact table with millions of records? I'm assuming you don't want to rebuild it every time new dimensions are added in like, especially when it requires updating the aggregated value.
I was thinking of creating a surrogate key on the fact table and then creating a separate dimension table with percent breakdowns (e.g. 0.75, 0.25, etc) and then just building a VIEW that joins them together and computes the new values? Something like this...
CREATE TABLE PageFacts (
PageSurKey INT PRIMARY KEY,
Date DATE,
Url_Id INT,
Page_Views INT
);
CREATE TABLE Device_Pct (
Id INT PRIMARY KEY,
PageSurKey INT FOREIGN KEY REFERENCES PageFacts(PageSurKey),
Device_Id INT,
Percentage FLOAT
);
CREATE VIEW Device_PageFacts AS
SELECT
pf.Date,
pf.Url_Id,
d.Device_Id,
SUM(pf.Page_Views * d.Percentage) as Page_Views
FROM PageFacts pf
JOIN Device_Pct d on d.PageSurkey = pf.PageSurkey
GROUP BY pf.Date, pf.Url_Id, d.Device_Id;
This seems like the way to go (since adding in new dimensions which further slice the data only requires knowing the ratio to split by), but I don't know if there's a better practice. Any insight would help. Thanks in advance.
Normally you will rebuild the fact table if you change its grain. That's a pretty significant design change to the dimensional model, and I don't see any benefit to complicating the schema to avoid rebuilding the fact table.
If you want, you could introduce this as an additional fact table, but it wouldn't have "percentages" just a normal fact table:
CREATE TABLE PageDeviceFacts
(
Date DATE references DimDate,
Url_Id INT references DimUrl,
Device_ID INT references DimDevice,
Page_Views INT,
constraint pk_PageDeviceFacts
primary key (Date,Url_Id,Device_ID)
);
And if it's a big table, make the PK non-clustered and create a clustered columnstore index
create clustered columnstore index cci_PageDeviceFacts on PageDeviceFacts
And the old PageFacts table becomes an aggregate fact in the model.