Search code examples
sqlsnowflake-cloud-data-platformdatediffpartition

Partitioning and selecting a value in Snowflake based on distance from date


I have a billion row dataset which is constantly building with more repeat data on customers.

ID   creation_date          report_date             status
001  2021-01-20T00:22:06Z   2021-02-02T00:22:06Z    ACTIVE
002  2021-01-30T00:22:06Z   2021-02-02T00:22:06Z    ACTIVE
003  2021-02-01T00:22:06Z   2021-02-02T00:22:06Z    ACTIVE
001  2021-01-20T00:22:06Z   2021-02-02T00:23:06Z    ACTIVE
002  2021-01-30T00:22:06Z   2021-02-02T00:23:06Z    ACTIVE
003  2021-02-01T00:22:06Z   2021-02-02T00:23:06Z    ACTIVE
001  2021-01-20T00:22:06Z   2021-02-19T00:22:06Z    ACTIVE
002  2021-01-30T00:22:06Z   2021-02-19T00:22:06Z    ACTIVE
003  2021-02-01T00:22:06Z   2021-02-19T00:22:06Z    ACTIVE
001  2021-01-20T00:22:06Z   2021-02-20T00:22:06Z    ACTIVE
002  2021-01-30T00:22:06Z   2021-02-20T00:22:06Z    EXPIRED
003  2021-02-01T00:22:06Z   2021-02-20T00:22:06Z    EXPIRED
001  2021-01-20T00:22:06Z   2021-02-21T00:22:06Z    ACTIVE
002  2021-01-30T00:22:06Z   2021-02-21T00:22:06Z    EXPIRED
003  2021-02-01T00:22:06Z   2021-02-21T00:22:06Z    EXPIRED
001  2021-01-20T00:22:06Z   2021-02-30T00:22:06Z    ACTIVE
002  2021-01-30T00:22:06Z   2021-02-30T00:22:06Z    EXPIRED
003  2021-02-01T00:22:06Z   2021-02-30T00:22:06Z    EXPIRED
001  2021-01-20T00:22:06Z   2021-03-01T00:22:06Z    ACTIVE
002  2021-01-30T00:22:06Z   2021-03-01T00:22:06Z    EXPIRED
003  2021-02-01T00:22:06Z   2021-03-01T00:22:06Z    ACTIVE
001  2021-01-20T00:22:06Z   2021-03-22T00:22:06Z    EXPIRED
002  2021-01-30T00:22:06Z   2021-03-22T00:22:06Z    EXPIRED
003  2021-02-01T00:22:06Z   2021-03-22T00:22:06Z    EXPIRED

Each report_date indicates a date where all records were updated to their current status. Like a pulse check.

And all I would like is the last status of which the user was during the week after a month from creation date (week 5).

As an example: ID = 001.

Here we see their creation date is 2021-01-20, meaning one month from this date is 2021-02-20. I would like to know:

  • What was the final status of this user during the report dates between 2021-02-20 and 2021-02-27?

You can see in the above data that Active remained active in all of the reports between 2021-02-20 and 2021-02-27 (that are listed.)

To make things simple, we only want to know the LAST change of status in this time frame. Notice in ID=003, they swapped to ACTIVE on 2021-02-22', so though they were EXPIRED` the day before, they switched back to active within the boundary.

Anything after a week after the month (anything after 5 weeks) is irrelevant.

You also may notice that 1 month from 2021-01-30 is 2021-02-30 which doesnt make sense. In these cases, use the final date of the month, or 2021-02-28.

Final output:

ID    week_5_status
001          ACTIVE   
002         EXPIRED
003          ACTIVE

Solution

  • First, convert the text values (presumably) to valid datetime values. Then, filter the rows such that report_datetime is fewer than 6 weeks after creation_datetime. Take the max of that filtered list, then join back to the original data to get the status for the row with the max value.

    CREATE TABLE t (id int, creation_date VARCHAR(19), report_date VARCHAR(19), status text);
    INSERT INTO t (id,creation_date,report_date,status) VALUES 
    (1,'2021-01-20T00:22:06','2021-02-02T00:22:06','ACTIVE'),
    (2,'2021-01-30T00:22:06','2021-02-02T00:22:06','ACTIVE'),
    (3,'2021-02-01T00:22:06','2021-02-02T00:22:06','ACTIVE'),
    (1,'2021-01-20T00:22:06','2021-02-02T00:23:06','ACTIVE'),
    (2,'2021-01-30T00:22:06','2021-02-02T00:23:06','ACTIVE'),
    (3,'2021-02-01T00:22:06','2021-02-02T00:23:06','ACTIVE'),
    (1,'2021-01-20T00:22:06','2021-02-19T00:22:06','ACTIVE'),
    (2,'2021-01-30T00:22:06','2021-02-19T00:22:06','ACTIVE'),
    (3,'2021-02-01T00:22:06','2021-02-19T00:22:06','ACTIVE'),
    (1,'2021-01-20T00:22:06','2021-02-20T00:22:06','ACTIVE'),
    (2,'2021-01-30T00:22:06','2021-02-20T00:22:06','EXPIRED'),
    (3,'2021-02-01T00:22:06','2021-02-20T00:22:06','EXPIRED'),
    (1,'2021-01-20T00:22:06','2021-02-21T00:22:06','ACTIVE'),
    (2,'2021-01-30T00:22:06','2021-02-21T00:22:06','EXPIRED'),
    (3,'2021-02-01T00:22:06','2021-02-21T00:22:06','EXPIRED'),
    (1,'2021-01-20T00:22:06','2021-02-30T00:22:06','ACTIVE'),
    (2,'2021-01-30T00:22:06','2021-02-30T00:22:06','EXPIRED'),
    (3,'2021-02-01T00:22:06','2021-02-30T00:22:06','EXPIRED'),
    (1,'2021-01-20T00:22:06','2021-03-01T00:22:06','ACTIVE'),
    (2,'2021-01-30T00:22:06','2021-03-01T00:22:06','EXPIRED'),
    (3,'2021-02-01T00:22:06','2021-03-01T00:22:06','ACTIVE'),
    (1,'2021-01-20T00:22:06','2021-03-22T00:22:06','EXPIRED'),
    (2,'2021-01-30T00:22:06','2021-03-22T00:22:06','EXPIRED'),
    (3,'2021-02-01T00:22:06','2021-03-22T00:22:06','EXPIRED');
    
    
    WITH dat
    AS
    (
    SELECT id
    , CAST(creation_date AS datetime) AS creation_datetime
    , CAST(REPLACE(report_date,'02-30','02-28') AS datetime) AS report_datetime
    , status
    FROM t
    ),
    dat2
    AS
    (
    SELECT id
    ,MAX(report_datetime) AS max_report_datetime
    FROM dat
    WHERE DATEDIFF(week,creation_datetime,report_datetime) < 6
    GROUP BY id
    )
    SELECT dat.*
    FROM dat
         INNER JOIN dat2 
                 ON dat.id = dat2.id 
                AND dat.report_datetime = dat2.max_report_datetime;
    

    dbfiddle.uk