I am new to DBT and I am facing a design challenge. The thing is I have 20+ data integrations. Each integration is feeding data into one Postgres DB. The Postgres DB has several tables for example integration_1_assets
, integration_2_assets
. These tables are all for assets but have different datasets (different column names etc).
Now I would like to create a DBT project to pick data for each integration then transform these into one final dim_assets
table. They may be one or more data transformations before coming final insert into dim_assets. Business requires that we run the transformations for each integration differently i.e. integration_1 should have its own transformation pipeline independent of others.
Should I create a DBT project for each integration pipeline or I can use one?
If you have one postgres database with many tables (one per integration), then I think you should have one DBT project.
The docs describe how to set up a postgres connection in profiles.yml
You'll want to configure some sources something like as follows (we'll call this sources.yml
):
version: 2
sources:
- name: assets
description: "Tables containing assets from various integrations"
database: your_database
schema: your_schema
tables:
- name: integration_1
identifier: integration_1_assets
description: "Assets from integration 1"
- name: integration_2
...
Following best practices for project structure I would suggest you create a set of 'staging' models that read directly from the source models and execute basic cleaning/transformation:
-- stg_assets_1.sql
SELECT
col1 AS standard_column_name_1,
...
colN AS standard_column_name_N
FROM
{{ source('assets', 'integration_1') }}
WHERE
data_is_good = true
... then via a series of intermediate steps combine the many staging models into a single dim_assets
model.
The project layout would look something like this
models
├── sources
│ └── sources.yml
├── staging
│ ├── stg_models.yml
│ ├── stg_assets_1.sql
...
│ └── stg_assets_N.sql
├── intermediate
│ ├── int_models.yml
│ ├── int_assets_combined_1.sql
...
│ └── int_assets_combined_M.sql
└── final
├── final_models.yml
└── dim_assets.sql
Here the intermediate/final models (or whatever you prefer to call them) would reference the earlier models using {{ ref('stg_assets_1') }}
etc.
The other YAML files are your model files, allowing you to document (and test) the models defined in each subdirectory.
Things such as materialisation strategy can be defined in the top level dbt_project.yml
(e.g. intermediate models could be ephemeral or views, while your final dim_assets
model could be materialised as a table).