I'm in the process of designing data warehouse structure and ETL process.
I've identified 'purchase' fact table, which would hold all the purchases done by the users.
Problem is, that there are multiple things in the process that user has to do:
It's important to show on the dashboard number of users who subscribed, who registered, who purchased something and conversion rates.
I've created following fact table (there are other things there, I skipped them to be more clear):
subscription_date
account_id
purchase_date
My idea was, that if someone subscribed, but didn't register, only subscription_date will be filled-in, rest of those will be null. If he subscribed and registered - subscription_date and account_id is filled in. If he did whole process, all three fields will be filled.
The problem is, that user can subscribe on day 1, register on day 2 and purchase on day 3. I'm using AWS glue to create ETL process. Everything works fine - it's fetching records from multiple database tables (subscription, account, purchase), combine them together and insert into purchase table. I do not know how to handle updates though.
What happens is if user subscribes on day 1, ETL job runs and inserts this record. If on day 2 user registers and on day 3 makes a purchase, this row is not updated. No new records are added there either.
I can think of two solutions for this problem:
Split fact table into three different fact tables and join them when populating the dashboard (I'm using quicksight for that). Then no updates would be needed. If new purchase for account is inserted in source database, it will be just inserted with proper account_id.
Somehow try add to AWS Glue Job check if this account_id for purchase already exists and update the record
Delete all records from fact table that are updated and re-insert them with all data.
The easiest solution would be to use Glue to land your data in staging tables (insert only) and then use SQL/Stored Procedures to upsert that data to the target tables.
BTW dimensional keys in your fact tables should never be null. So if a customer has not yet registered then the fact table registered date column should reference the “unknown date” row in your date dimension