Search code examples
amazon-redshiftdata-warehousedbt

Working with multiple data warehouses in dbt


I'm building an application where each of our clients needs their own data warehouse (for security, compliance, and maintainability reasons). For each client we pull in data from multiple third party integrations and then merge them into a unified view, which we use to perform analytics and report metrics for the data across those integrations. These transformations and all relevant schemas are the same for all clients. We would need this to scale to 1000s of clients.

From what I gather dbt is designed so each project corresponds with one warehouse. I see two options:

  1. Use one project and create a separate environment target for each client (and maybe a single dev environment). Given that environments aren't designed for this, are there any catches to this? Will scheduling, orchestrating, or querying the outputs be painful or unscalable for some reason?

profiles.yml:

example_project:
  target: dev
  outputs:
    dev:
      type: redshift
      ...
    client_1:
      type: redshift
      ...
    client_2:
      type: redshift
      ...
    ...
  1. Create multiple projects, and create a shared dbt package containing most of the logic. This seems very unwieldy needing to maintain a separate repo for each client and less developer friendly.

profiles.yml:

client_1_project:
  target: dev
  outputs:
    client_1:
      type: redshift
      ...
client_2_project:
  target: dev
  outputs:
    client_2:
      type: redshift
      ...

Thoughts?


Solution

  • I think you captured both options.

    If you have a single database connection, and your client data is logically separated in that connection, I would definitely pick #2 (one package, many client projects) over #1. Some reasons:

    1. Selecting data from a different source (within a single connection), depending on the target, is a bit hacky, and wouldn't scale well for 1000's of clients.
    2. The developer experience for packages isn't so bad. You will want a developer data source, but depending on your business you could maybe get away with using one client's data (or an anonymized version of that). It will be good to keep this developer environment logically separate from any individual client's implementation, and packages allow you to do that.
    3. I would consider generating the client projects programmatically, probably using a Python CLI to set up, dbt run, and tear down the required files for each client project (I'm assuming you're not going to use dbt Cloud and have another orchestrator or compute environment that you control). It's easy to write YAML from Python with pyyaml (each file is just a dict), and your individual projects probably only need separate profiles.yml, sources.yml, and (maybe) dbt_project.yml files. I wouldn't check these generated files for each client into source control -- just check in the script and generate the files you need with each invocation of dbt.

    On the other hand, if your clients each have their own physical database with separate connections and credentials, and those databases are absolutely identical, you could get away with #1 (one project, many profiles). The "hardest" parts of that approach would likely be managing secrets and generating/maintaining a list of targets that you could iterate over (ideally in a parallel fashion).