Search code examples
sqlamazon-redshiftnested-loopsdate-range

Date Range Nested Loop Optimization


Redshift db.

Table A is a date/calendar table.

Table B is a member table. Table B structured as a slowly changing dimension type 6. It has nearly 200 M records.

The goal is to write a performant performant query that gives the count of members for every day in the last 4 years. My first attempted resulting in a query like so:

select
      date,
      location,
      sub_location,
      race,
      gender,
      dob,
      member_type,
      count(distinct member_id)

from date_table d

join member_table m
 on m.row_start <=  d.full_date
 and m.row_end >= d.full_date
 and m.is_active = 'Y'
 and m.row_end >= '2019-01-01'

where d.date_key >= 20190101
 and d.date_key <= to_char(current_date, 'yyyymmdd')

group by 
      date,
      location,
      sub_location,
      race,
      gender,
      dob,
      member_type
      

The performance on this is god awful because of the join being a nested loop. I've been trying to think of a way to rework this to avoid that issue but have not had any success. Curious if there is a way to do so that would increase performance significantly.

For reference here are the table designs as well as the explain plan:

create table date_table
(
    date_key                       integer not null encode delta
        primary key,
    full_date                      date encode delta,
)
    diststyle all
    sortkey (date_key);

create table member_tabnle
(
    member_key                 bigint not null
        primary key,
    member_id                  integer,
    location                   integer distkey,
    sub_location               integer encode zstd,
    gender                     varchar(50) encode zstd,
    race                       varchar(100) encode zstd,
    date_of_birth              date encode delta32k,
    member_type                char(10) encode zstd,
    active                     char encode zstd,
    row_start                  timestamp encode zstd,
    row_end                    timestamp encode zstd,
)
    diststyle key
    interleaved sortkey (location, member_id);

execution plan

I've rewritten the query in various ways, none of which meaningfully impacted performance.

The output should be

Date, member attributes, count of records

Solution

  • You're in luck as I solved this exact issue a few years back and wrote up a description on the solution. You can find it here - http://wad-design.s3-website-us-east-1.amazonaws.com/sql_limits_wp.html

    The basic issue you are facing is the need to massively grow the data before you can condense it down. These fat-in-the-middle queries can be expensive on Redshift and often spill to disk making them even more expensive. The solution is to not create a row for each account for each date but rather to look at it as counting account starts by date and account ends by date - the active accounts is the difference between the rolling sums of these values. I was able to take a clients query run time down from 45 minutes to 17 seconds using this approach.

    If the approach isn't clear let me know in a comment and I can help apply this approach to your situation. It can trip people up the first time.

    This approach can be used to solve other problems efficiently like joining on the nearest date.