structure etl data-modeling dbt fishtown-analytics

DBT - best practices to create staging views for the final business model

I am working on a business model/view, whose raw SQL definition contains a very complex query on the source table.

My question here is - that I need to create multiple staging models on the source table as there are different specific select queries on the same source(which cannot be taken care of in a single staging model on the source). What should be the best practice to deal such scenario.

One way is to create all the different staging models directly on source and use them in my final business model.

Second way is to create one staging model on the source that will have all fields from the source (and no transformation required in my particular case), and then use this staging model to create all the other intermediate staging models with the specific sql query.

Let me know if there is any other better way for this.

Solution

I appreciate the thought you’re putting into it! My answer is really more opinion than anything.

I think you should follow the practice of one staging model per source. Staging models are just for translating and light cleaning. They most come in handy when the source data changes. In that scenario, you’ll only need to change the staging model and not have to touch any downstream models.

So one staging model for your source in question, and (obviously) one final table for the end-users. As for what’s in b/w, if whatever you like! You can choose to have:

your 3 SELECTs be all views selecting from your staging model, and the final model selections from that view.
just one table model built directly off the staging model, but with your 3 SELECTs be CTEs in the final model query, or,
your 3 SELECTs be all ephemeral models (if your database adapter supports them) selecting from your staging model, and the final model selections from those models. This is sort-of a synthesis of the above two approaches.

There’s also many methods to hide your intermediary models as well from your end user.

You just make whatever makes sense for you and your users across the following dimensions:

observability (what’s the easiest to way to read and understand the logic you’ve implemented?)
performance (what’s fastest for you end user and for your development)
long-term maintenance (what will give you the flexibility to quickly make adjustments and improvements as your project grows?)