Redshift db.
Table A is a date/calendar table.
Table B is a member table. Table B structured as a slowly changing dimension type 6. It has nearly 200 M records.
The goal is to write a performant performant query that gives the count of members for every day in the last 4 years. My first attempted resulting in a query like so:
select
date,
location,
sub_location,
race,
gender,
dob,
member_type,
count(distinct member_id)
from date_table d
join member_table m
on m.row_start <= d.full_date
and m.row_end >= d.full_date
and m.is_active = 'Y'
and m.row_end >= '2019-01-01'
where d.date_key >= 20190101
and d.date_key <= to_char(current_date, 'yyyymmdd')
group by
date,
location,
sub_location,
race,
gender,
dob,
member_type
The performance on this is god awful because of the join being a nested loop. I've been trying to think of a way to rework this to avoid that issue but have not had any success. Curious if there is a way to do so that would increase performance significantly.
For reference here are the table designs as well as the explain plan:
create table date_table
(
date_key integer not null encode delta
primary key,
full_date date encode delta,
)
diststyle all
sortkey (date_key);
create table member_tabnle
(
member_key bigint not null
primary key,
member_id integer,
location integer distkey,
sub_location integer encode zstd,
gender varchar(50) encode zstd,
race varchar(100) encode zstd,
date_of_birth date encode delta32k,
member_type char(10) encode zstd,
active char encode zstd,
row_start timestamp encode zstd,
row_end timestamp encode zstd,
)
diststyle key
interleaved sortkey (location, member_id);
I've rewritten the query in various ways, none of which meaningfully impacted performance.
The output should be
Date, member attributes, count of records
You're in luck as I solved this exact issue a few years back and wrote up a description on the solution. You can find it here - http://wad-design.s3-website-us-east-1.amazonaws.com/sql_limits_wp.html
The basic issue you are facing is the need to massively grow the data before you can condense it down. These fat-in-the-middle queries can be expensive on Redshift and often spill to disk making them even more expensive. The solution is to not create a row for each account for each date but rather to look at it as counting account starts by date and account ends by date - the active accounts is the difference between the rolling sums of these values. I was able to take a clients query run time down from 45 minutes to 17 seconds using this approach.
If the approach isn't clear let me know in a comment and I can help apply this approach to your situation. It can trip people up the first time.
This approach can be used to solve other problems efficiently like joining on the nearest date.