Search code examples
snowflake-cloud-data-platformdata-warehouseexternal-tablesdelta-lakedbt

ad-hoc slowly-changing dimensions materialization from external table of timestamped csvs in a data lake


Question

main question

How can I ephemerally materialize slowly changing dimension type 2 from from a folder of daily extracts, where each csv is one full extract of a table from from a source system?

rationale

We're designing ephemeral data warehouses as data marts for end users that can be spun up and burned down without consequence. This requires we have all data in a lake/blob/bucket.

We're ripping daily full extracts because:

  1. we couldn't reliably extract just the changeset (for reasons out of our control), and
  2. we'd like to maintain a data lake with the "rawest" possible data.

challenge question

Is there a solution that could give me the state as of a specific date and not just the "newest" state?

existential question

Am I thinking about this completely backwards and there's a much easier way to do this?

Possible Approaches

custom dbt materialization

There's a insert_by_period dbt materialization in the dbt.utils package, that I think might be exactly what I'm looking for? But I'm confused as it's dbt snapshot, but:

  1. run dbt snapshot for each file incrementally, all at once; and,
  2. built directly off of an external table?

Delta Lake

I don't know much about Databricks's Delta Lake, but it seems like it should be possible with Delta Tables?

Fix the extraction job

Is our oroblem is solved if we can make our extracts contain only what has changed since the previous extract?

Example

Suppose the following three files are in a folder of a data lake. (Gist with the 3 csvs and desired table outcome as csv). I added the Extracted column in case parsing the timestamp from the filename is too tricky.

2020-09-14_CRM_extract.csv

| OppId | CustId | Stage       | Won | LastModified | Extracted |
|-------|--------|-------------|-----|--------------|-----------|
| 1     | A      | 2 - Qualify |     | 9/1          | 9/14      |
| 2     | B      | 3 - Propose |     | 9/12         | 9/14      |

2020-09-15_CRM_extract.csv

| OppId | CustId | Stage       | Won | LastModified | Extracted |
|-------|--------|-------------|-----|--------------|-----------|
| 1     | A      | 2 - Qualify |     | 9/1          | 9/15      |
| 2     | B      | 4 - Closed  | Y   | 9/14         | 9/15      |
| 3     | C      | 1 - Lead    |     | 9/14         | 9/15      |

2020-09-16_CRM_extract.csv

| OppId | CustId | Stage       | Won | LastModified | Extracted |
|-------|--------|-------------|-----|--------------|-----------|
| 1     | A      | 2 - Qualify |     | 9/1          | 9/16      |
| 2     | B      | 4 - Closed  | Y   | 9/14         | 9/16      |
| 3     | C      | 2 - Qualify |     | 9/15         | 9/16      |

End Result

Below is SCD-II for the three files as of 9/16. SCD-II as of 9/15 would be the same but OppId=3 has only one from valid_from=9/15 and valid_to=null

| OppId | CustId | Stage       | Won | LastModified | valid_from | valid_to |
|-------|--------|-------------|-----|--------------|------------|----------|
| 1     | A      | 2 - Qualify |     | 9/1          | 9/14       | null     |
| 2     | B      | 3 - Propose |     | 9/12         | 9/14       | 9/15     |
| 2     | B      | 4 - Closed  | Y   | 9/14         | 9/15       | null     |
| 3     | C      | 1 - Lead    |     | 9/14         | 9/15       | 9/16     |
| 3     | C      | 2 - Qualify |     | 9/15         | 9/16       | null     |

Solution

  • Interesting concept and of course it would a longer conversation than is possible in this forum to fully understand your business, stakeholders, data, etc. I can see that it might work if you had a relatively small volume of data, your source systems rarely changed, your reporting requirements (and hence, datamarts) also rarely changed and you only needed to spin up these datamarts very infrequently.

    My concerns would be:

    1. If your source or target requirements change how are you going to handle this? You will need to spin up your datamart, do full regression testing on it, apply your changes and then test them. If you do this as/when the changes are known then it's a lot of effort for a Datamart that's not being used - especially if you need to do this multiple times between uses; if you do this when the datamart is needed then you're not meeting your objective of having the datamart available for "instant" use.

    Your statement "we have a DW as code that can be deleted, updated, and recreated without the complexity that goes along with traditional DW change management" I'm not sure is true. How are you going to test updates to your code without spinning up the datamart(s) and going through a standard test cycle with data - and then how is this different from traditional DW change management?

    1. What happens if there is corrupt/unexpected data in your source systems? In a "normal" DW where you are loading data daily this would normally be noticed and fixed on the day. In your solution the dodgy data might have occurred days/weeks ago and, assuming it loaded into your datamart rather than erroring on load, you would need processes in place to spot it and then potentially have to unravel days of SCD records to fix the problem
    2. (Only relevant if you have a significant volume of data) Given the low cost of storage, I'm not sure I see the benefit of spinning up a datamart when needed as opposed to just holding the data so it's ready for use. Loading large volumes of data everytime you spin up a datamart is going to be time-consuming and expensive. Possible hybrid approach might be to only run incremental loads when the datamart is needed rather than running them every day - so you have the data from when the datamart was last used ready to go at all times and you just add the records created/updated since the last load