I am learning DBT, specifically dbt-mysql. I am having trouble combining several tables into one table. What I want to do: Group By several columns by the last_updated (timestamp) date of the table and then combine those columns into a single table by the split last_updated field. Here is how I want my data to end up:
Here is my staging model (which I think should be straight selects from the database):
staging/clients/stg_clients_fields.sql
SELECT id, created, last_updated, service, order_count, spent_count, deleted, country
FROM client_database.clients
Then I have intermediate models (which I think should reconstruct data for my needs): intermediate/clients/clients_last_updated_grouped.sql
SELECT YEAR(last_updated) as year_updated, MONTH(last_updated) as month_updated, COUNT(id) as client_count
FROM {{ ref('stg_clients_fields') }}
GROUP BY YEAR(last_updated), MONTH (last_updated)
intermediate/clients/clients_deleted_grouped.sql
SELECT YEAR(last_updated) as year_updated, MONTH(last_updated) as month_updated, COUNT(id) as deleted
FROM {{ ref('stg_clients_fields') }}
WHERE deleted = 1
GROUP BY YEAR(last_updated), MONTH (last_updated)
intermediate/clients/clients_service_grouped.sql
SELECT YEAR(last_updated) as year_updated, MONTH(last_updated) as month_updated, COUNT(id) as service
FROM {{ ref('stg_clients_fields') }}
WHERE service IS NOT NULL
GROUP BY YEAR(last_updated), MONTH (last_updated)
And other columns follow the same pattern based on their WHERE clauses.
Now I need to create a marts model that would use all previously created data and put it in one single table.
At this point, I end up with several tables that have the last_updated field separated and the specific column value next to the date.
How can I now combine all these tables that they would join on the last_updated split into to columns field?
Or perhaps there is a better solution to group data by year and month and get individual column values based on conditions?
I am new to DBT so all the help and all advice are welcome!
since clients_last_updated_grouped
doesn't have a where
condition, it's guaranteed to have all of the year/month combinations found in the other models. This makes it much easier. You can just select from that model and join the other models on year and month:
with
updated as (select * from {{ ref('clients_last_updated_grouped') }} ),
deleted as (select * from ),
service as (select * from ),
joined as (
select
updated.year,
updated.month,
updated.client_count,
coalesce(deleted.deleted, 0) as deleted_count,
coalesce(service.service, 0) as service_count
from
updated
left join deleted on updated.year = deleted.year and updated.month = deleted.month
left join service on updated.year = service.year and updated.month = service.month
)
select *
from joined
If your database doesn't support CTEs (with ...
), this becomes:
select
updated.year,
updated.month,
updated.client_count,
coalesce(deleted.deleted, 0) as deleted_count,
coalesce(service.service, 0) as service_count
from
{{ ref('clients_last_updated_grouped') }} as updated
left join {{ ref('clients_deleted_grouped') }} as deleted on updated.year = deleted.year and updated.month = deleted.month
left join {{ ref('clients_service_grouped') }} as service on updated.year = service.year and updated.month = service.month
If it's not the case that clients_last_updated_grouped
has every month/year combination of the other tables, you would need to first construct a "date spine", and then left join all 3 tables to that date spine.