I'm building an Analytics application where we track the conversions for a companies marketing Campaigns. A conversion is if they go to a supermarket and buy the product. If the company is Heinz, they might run Campaigns for different products, so Campaigns might be:
These are online Campaigns so they can have different Mediums such as:
And if someone buys a product, it's bought via a Supermarket such as:
And we're tracking conversions for all these. The analytics application needs to display conversions data for any combination of the above. So for instance, I might need to show conversions...
To make the analytics fast, we're avoiding dealing with the raw data (millions of records), but storing an aggregated version of the data stored on a per day basis. So for the 12th September say, I can stored that we had 12 conversions for Baked Beans, 6 conversions (for all products) were made via the website, and Walmart had 8 conversions, and these could go in 3 separate tables (called Campaigns, Mediums, and Supermarkets). But if I need to know conversions for Ketchup made via the Facebook page and for Walmart, storing in separate tables obvious won't work.
I'm struggling to come up with a data model that can support the above. I'm using a standard Relational database (MySQL). Perhaps there is a better strategy to handle this.
Yes there is a better strategy. It's called Dimensional Modeling, or Star Schema.
You store one table, called the Fact Table, which has columns for Campaign, Medium, and Supermarket.
For each of these attributes, the column in the Fact Table is a foreign key to a Dimension Table. One dimension table for each, Campaigns, Mediums, and Supermarkets.
In your case, to avoid millions of rows, you could add a fourth column to the fact table, conversions
. Increment the count when you get a conversion (as opposed to adding another row).
Here's an example table:
CREATE TABLE FactTable (
campaign_id INT,
medium_id INT,
supermarket_id INT,
conversions INT,
PRIMARY KEY (campaign_id, medium_id, supermarket_id),
FOREIGN KEY (campaign_id) REFERENCES Campaigns(campaign_id),
FOREIGN KEY (medium_id) REFERENCES Mediums(medium_id),
FOREIGN KEY (supermarket_id) REFERENCES Supermarkets(supermarket_id)
);
Then you can query for all conversions:
for Baked Beans.
SELECT SUM(conversions) FROM FactTable
JOIN Campaigns USING (campaign_id)
WHERE campaign = 'Baked Beans';
for Baked beans from the Facebook page.
SELECT SUM(conversions) FROM FactTable
JOIN Campaigns USING (campaign_id)
JOIN Mediums USING (medium_id)
WHERE campaign = 'Baked Beans' AND medium = 'Facebook';
for the Supermarket Walmart but for all Campaigns & Mediums.
SELECT SUM(conversions) FROM FactTable
JOIN Supermarkets USING (supermarket_id)
WHERE supermarket = 'Walmart';
for Walmart made from the Facebook page but for all Campaigns.
SELECT SUM(conversions) FROM FactTable
JOIN Mediums USING (medium_id)
JOIN Supermarkets USING (supermarket_id)
WHERE medium = 'Facebook' AND supermarket = 'Walmart';
for Ketchup made via the Flash Banner Ad and for Safeway.
SELECT SUM(conversions) FROM FactTable
JOIN Campaigns USING (campaign_id)
JOIN Mediums USING (medium_id)
JOIN Supermarkets USING (supermarket_id)
WHERE campaign = 'Ketchup' AND medium = 'Flash Banner Ad' AND supermarket = 'Safeway';
Check out books by Ralph Kimball for more on Dimensional Modeling.