Multiple datasets (sources) in the same DBT project

I am new to DBT and I am facing a design challenge. The thing is I have 20+ data integrations. Each integration is feeding data into one Postgres DB. The Postgres DB has several tables for example integration_1_assets, integration_2_assets. These tables are all for assets but have different datasets (different column names etc).

Now I would like to create a DBT project to pick data for each integration then transform these into one final dim_assets table. They may be one or more data transformations before coming final insert into dim_assets. Business requires that we run the transformations for each integration differently i.e. integration_1 should have its own transformation pipeline independent of others.

Should I create a DBT project for each integration pipeline or I can use one?

Solution

If you have one postgres database with many tables (one per integration), then I think you should have one DBT project.

The docs describe how to set up a postgres connection in profiles.yml

You'll want to configure some sources something like as follows (we'll call this sources.yml):

version: 2

sources:

  - name: assets
    description: "Tables containing assets from various integrations"
    database: your_database
    schema: your_schema
    tables:
      - name: integration_1
        identifier: integration_1_assets
        description: "Assets from integration 1"
      - name: integration_2
...

Following best practices for project structure I would suggest you create a set of 'staging' models that read directly from the source models and execute basic cleaning/transformation:

-- stg_assets_1.sql

    SELECT
        col1 AS standard_column_name_1,
        ...
        colN AS standard_column_name_N
      FROM
        {{ source('assets', 'integration_1') }}
     WHERE
        data_is_good = true

... then via a series of intermediate steps combine the many staging models into a single dim_assets model.

The project layout would look something like this

models
├── sources
│   └── sources.yml
├── staging
│   ├── stg_models.yml
│   ├── stg_assets_1.sql
...
│   └── stg_assets_N.sql
├── intermediate
│   ├── int_models.yml
│   ├── int_assets_combined_1.sql
...
│   └── int_assets_combined_M.sql
└── final
    ├── final_models.yml
    └── dim_assets.sql

Here the intermediate/final models (or whatever you prefer to call them) would reference the earlier models using {{ ref('stg_assets_1') }} etc.

The other YAML files are your model files, allowing you to document (and test) the models defined in each subdirectory.

Things such as materialisation strategy can be defined in the top level dbt_project.yml (e.g. intermediate models could be ephemeral or views, while your final dim_assets model could be materialised as a table).