How can I ephemerally materialize slowly changing dimension type 2 from from a folder of daily extracts, where each csv is one full extract of a table from from a source system?
We're designing ephemeral data warehouses as data marts for end users that can be spun up and burned down without consequence. This requires we have all data in a lake/blob/bucket.
We're ripping daily full extracts because:
Is there a solution that could give me the state as of a specific date and not just the "newest" state?
Am I thinking about this completely backwards and there's a much easier way to do this?
dbt
materializationThere's a insert_by_period
dbt materialization in the dbt.utils
package, that I think might be exactly what I'm looking for? But I'm confused as it's dbt snapshot
, but:
dbt snapshot
for each file incrementally, all at once; and,I don't know much about Databricks's Delta Lake, but it seems like it should be possible with Delta Tables?
Is our oroblem is solved if we can make our extracts contain only what has changed since the previous extract?
Suppose the following three files are in a folder of a data lake. (Gist with the 3 csvs and desired table outcome as csv). I added the Extracted column in case parsing the timestamp from the filename is too tricky.
2020-09-14_CRM_extract.csv
| OppId | CustId | Stage | Won | LastModified | Extracted |
|-------|--------|-------------|-----|--------------|-----------|
| 1 | A | 2 - Qualify | | 9/1 | 9/14 |
| 2 | B | 3 - Propose | | 9/12 | 9/14 |
2020-09-15_CRM_extract.csv
| OppId | CustId | Stage | Won | LastModified | Extracted |
|-------|--------|-------------|-----|--------------|-----------|
| 1 | A | 2 - Qualify | | 9/1 | 9/15 |
| 2 | B | 4 - Closed | Y | 9/14 | 9/15 |
| 3 | C | 1 - Lead | | 9/14 | 9/15 |
2020-09-16_CRM_extract.csv
| OppId | CustId | Stage | Won | LastModified | Extracted |
|-------|--------|-------------|-----|--------------|-----------|
| 1 | A | 2 - Qualify | | 9/1 | 9/16 |
| 2 | B | 4 - Closed | Y | 9/14 | 9/16 |
| 3 | C | 2 - Qualify | | 9/15 | 9/16 |
Below is SCD-II for the three files as of 9/16. SCD-II as of 9/15 would be the same but OppId=3
has only one from valid_from=9/15
and valid_to=null
| OppId | CustId | Stage | Won | LastModified | valid_from | valid_to |
|-------|--------|-------------|-----|--------------|------------|----------|
| 1 | A | 2 - Qualify | | 9/1 | 9/14 | null |
| 2 | B | 3 - Propose | | 9/12 | 9/14 | 9/15 |
| 2 | B | 4 - Closed | Y | 9/14 | 9/15 | null |
| 3 | C | 1 - Lead | | 9/14 | 9/15 | 9/16 |
| 3 | C | 2 - Qualify | | 9/15 | 9/16 | null |
Interesting concept and of course it would a longer conversation than is possible in this forum to fully understand your business, stakeholders, data, etc. I can see that it might work if you had a relatively small volume of data, your source systems rarely changed, your reporting requirements (and hence, datamarts) also rarely changed and you only needed to spin up these datamarts very infrequently.
My concerns would be:
Your statement "we have a DW as code that can be deleted, updated, and recreated without the complexity that goes along with traditional DW change management" I'm not sure is true. How are you going to test updates to your code without spinning up the datamart(s) and going through a standard test cycle with data - and then how is this different from traditional DW change management?