Search code examples
sqloraclepivotsnowflake-cloud-data-platformscd

Pivot/Denormalize Over Effective Ranges


I'm looking to pivot a transactional data set into SCD2 in such a way that captures the intervals in which each combination at the pivot grain were effective.

Snowflake is the actual DBMS I'm using, but tagging Oracle too because their dialects are nearly identical. I could probably finagle a solution provided for any DBMS though.

I have working sql, but this it's born out of trial&error and I feel like there has to be a more elegant way that I'm missing because it's very ugly and computationally expensive.

(Note: The second record in the input data "expires" the first. It can be assumed that every day of interest will occur as an add_dts at least once.) (added as image at the end until I can figure out why the markup isn't working)

Input:

Original_Grain Pivot_Grain Pivot_Column Pivot_Attribute ADD_TS
OG-1 PG-1 First_Col A 2020-01-01
OG-1 PG-1 First_Col B 2020-01-02
OG-2 PG-1 Second_Col A 2020-01-01
OG-3 PG-1 Third_Col C 2020-01-02
OG-3 PG-1 Third_Col B 2020-01-03

Output:

Pivot_Grain First_Col Second_Col Third_Col From_Dt To_Dt
PG-1 A A NULL 2020-01-01 2020-01-02
PG-1 B A C 2020-01-02 2020-01-03
PG-1 B A B 2020-01-03 9999-01-01
WITH INPUT AS 
  ( SELECT 'OG-1' AS Original_Grain,
           'PG-1' AS Pivot_Grain,
           'First_Col' AS Pivot_Column,
           'A' AS Pivot_Attribute,
           TO_DATE('2020-01-01','YYYY-MM-DD') AS Add_Dts
      FROM dual
     UNION
   SELECT 'OG-1' AS Original_Grain,
           'PG-1' AS Pivot_Grain,
           'First_Col' AS Pivot_Column,
           'B' AS Pivot_Attribute,
           TO_DATE('2020-01-02','YYYY-MM-DD')
      FROM dual
     UNION
   SELECT 'OG-2' AS Original_Grain,
           'PG-1' AS Pivot_Grain,
           'Second_Col' AS Pivot_Column,
           'A' AS Pivot_Attribute,
           TO_DATE('2020-01-01','YYYY-MM-DD')
      FROM dual
     UNION
   SELECT 'OG-3' AS Original_Grain,
           'PG-1' AS Pivot_Grain,
           'Third_Col' AS Pivot_Column,
           'C' AS Pivot_Attribute,
           TO_DATE('2020-01-02','YYYY-MM-DD')
      FROM dual
     UNION
   SELECT 'OG-3' AS Original_Grain,
           'PG-1' AS Pivot_Grain,
           'Third_Col' AS Pivot_Column,
           'B' AS Pivot_Attribute,
           TO_DATE('2020-01-03','YYYY-MM-DD')
      FROM dual     
  ),
  GET_NORMALIZED_RANGES AS 
  ( SELECT I.*,
           COALESCE(
             LEAD(Add_Dts) OVER (
             PARTITION BY I.Original_Grain
                 ORDER BY I.Add_Dts), TO_DATE('9000-01-01')
             ) AS Next_Add_Dts
      FROM INPUT I
  ),
  GET_DISTINCT_ADD_DATES AS 
  ( SELECT DISTINCT Add_Dts AS Driving_Date
      FROM Input  
  ),
  NORMALIZED_EFFECTIVE_AT_EACH_POINT AS 
  ( SELECT GNR.*,
           GDAD.Driving_Date
      FROM GET_NORMALIZED_RANGES GNR
     INNER
      JOIN GET_DISTINCT_ADD_DATES GDAD
        ON GDAD.driving_date >= GNR.add_dts
       AND GDAD.driving_Date < GNR.next_add_dts
  ),
  PIVOT_EACH_POINT AS 
  ( SELECT DISTINCT
           Pivot_Grain,
           Driving_Date,
           MAX("'First_Col'") OVER ( PARTITION BY Pivot_Grain, Driving_Date) AS First_Col,
           MAX("'Second_Col'") OVER ( PARTITION BY Pivot_Grain, Driving_Date) AS Second_Col,
           MAX("'Third_Col'") OVER ( PARTITION BY Pivot_Grain, Driving_Date) AS Third_Col
      FROM NORMALIZED_EFFECTIVE_AT_EACH_POINT NEP
     PIVOT (MAX(Pivot_Attribute) FOR PIVOT_COLUMN IN ('First_Col','Second_Col','Third_Col'))
  )
  SELECT Pivot_Grain,
         Driving_Date AS From_Dt,
         COALESCE(LEAD(Driving_Date) OVER ( PARTITION BY pivot_grain ORDER BY Driving_Date),TO_DATE('9999-01-01')) AS To_Dt,
         First_Col,
         Second_Col,
         Third_Col
    FROM PIVOT_EACH_POINT

Solution

  • so the input can be written with the VALUES operator, and the column names put into the CTE definition, making it take up less space.

    WITH input(original_grain, pivot_grain, pivot_column, pivot_attribute, add_dts) AS ( 
        SELECT * FROM VALUES 
            ('OG-1', 'PG-1', 'First_Col', 'A', '2020-01-01'::date),
            ('OG-1', 'PG-1', 'First_Col', 'B', '2020-01-02'::date),
            ('OG-2', 'PG-1', 'Second_Col', 'A', '2020-01-01'::date),
            ('OG-3', 'PG-1', 'Third_Col', 'C', '2020-01-02'::date),
            ('OG-3', 'PG-1', 'Third_Col', 'B', '2020-01-03'::date) 
    )
      
    

    the LEAD in can be simplified by using the default value, which is an implicit COALESCE, but also some times if you have gaps in your data of this type IGNORE NULLS OVER is an awesome tool.

    , get_normalized_ranges AS ( 
        SELECT 
            *
            ,LEAD(add_dts,1,'9000-01-01'::date) OVER (PARTITION BY original_grain ORDER BY add_dts) AS next_add_dts
        FROM input
    )
      
    

    get_distinct_add_dates seems fine.

    , get_distinct_add_dates AS ( 
        SELECT DISTINCT add_dts AS driving_date
        FROM input  
    )
    

    depending on your data normalized_effective_at_each_point will do it's name sake and give you a value at every time/date point, which will over slice the values of that are unrelated (I assume pivot_grain is some global thing id that is distinct data thus this input backs it)

            ('OG-1', 'PG-1', 'First_Col', 'A', '2020-01-01'::date),
            ('OG-1', 'PG-1', 'First_Col', 'B', '2020-01-03'::date),
            ('OG-2', 'PG-1', 'Second_Col','A', '2020-01-01'::date),
            ('OG-3', 'PG-1', 'Third_Col', 'C', '2020-01-03'::date),
            ('OG-3', 'PG-1', 'Third_Col', 'B', '2020-01-05'::date),
            ('OG-4', 'PG-2', 'First_Col', 'D', '2020-02-02'::date),
            ('OG-4', 'PG-2', 'First_Col', 'E', '2020-02-04'::date),
            ('OG-5', 'PG-2', 'Second_Col','D', '2020-02-02'::date),
            ('OG-6', 'PG-2', 'Third_Col', 'F', '2020-02-04'::date),
            ('OG-6', 'PG-2', 'Third_Col', 'D', '2020-02-06'::date) 
            
    

    at which point get_distinct_add_dates should become:

    , get_distinct_add_dates AS ( 
        SELECT DISTINCT pivot_grain, add_dts AS driving_date
        FROM input  
    )
            
    

    INNER JOIN is a JOIN so we can skip the unneeded INNER

    , normalized_effective_at_each_point AS ( 
        SELECT gnr.*,
            gdad.driving_date
        FROM get_normalized_ranges AS gnr
        JOIN get_distinct_add_dates AS gdad
            ON gnr.pivot_grain = gdad.pivot_grain 
            AND gdad.driving_date >= gnr.add_dts
            AND gdad.driving_date < gnr.next_add_dts
      ),
      
    

    really pivot_each_point is a three way JOIN, or a GROUP BY could be written, which the DISTINCT was really doing for us, and thus the PIVOT is gone.

      , pivot_each_point AS (
        SELECT Pivot_Grain
            ,Driving_Date
            ,MAX(IFF(pivot_column='First_Col', Pivot_Attribute, NULL)) as first_col
            ,MAX(IFF(pivot_column='Second_Col', Pivot_Attribute, NULL)) as second_col
            ,MAX(IFF(pivot_column='Third_Col', Pivot_Attribute, NULL)) as third_col
        FROM normalized_effective_at_each_point
        GROUP BY 1,2
      )
      
    

    and lastly the final lead can drop the COALESCE and be moved into pivot_each_point.

    WITH input(original_grain, pivot_grain, pivot_column, pivot_attribute, add_dts) AS ( 
        SELECT * FROM VALUES 
            ('OG-1', 'PG-1', 'First_Col', 'A', '2020-01-01'::date),
            ('OG-1', 'PG-1', 'First_Col', 'B', '2020-01-03'::date),
            ('OG-2', 'PG-1', 'Second_Col','A', '2020-01-01'::date),
            ('OG-3', 'PG-1', 'Third_Col', 'C', '2020-01-03'::date),
            ('OG-3', 'PG-1', 'Third_Col', 'B', '2020-01-05'::date),
            ('OG-4', 'PG-2', 'First_Col', 'D', '2020-02-02'::date),
            ('OG-4', 'PG-2', 'First_Col', 'E', '2020-02-04'::date),
            ('OG-5', 'PG-2', 'Second_Col','D', '2020-02-02'::date),
            ('OG-6', 'PG-2', 'Third_Col', 'F', '2020-02-04'::date),
            ('OG-6', 'PG-2', 'Third_Col', 'D', '2020-02-06'::date) 
    ), get_normalized_ranges AS (      
        SELECT 
            *
            ,LEAD(add_dts,1,'9000-01-01'::date) OVER (PARTITION BY original_grain ORDER BY add_dts) AS next_add_dts
        FROM input
    ), get_distinct_add_dates AS ( 
        SELECT DISTINCT pivot_grain, add_dts AS driving_date
        FROM input  
    ), normalized_effective_at_each_point AS ( 
        SELECT gnr.*,
            gdad.driving_date
        FROM get_normalized_ranges AS gnr
        JOIN get_distinct_add_dates AS gdad
            ON gnr.pivot_grain = gdad.pivot_grain 
            AND gdad.driving_date >= gnr.add_dts
            AND gdad.driving_date < gnr.next_add_dts 
    )
    SELECT pivot_grain
        ,driving_date
        ,LEAD(driving_date, 1, '9999-01-01'::date) OVER (PARTITION BY pivot_grain ORDER BY driving_date) AS to_dt
        ,MAX(IFF(pivot_column = 'First_Col', pivot_attribute, NULL)) AS first_col
        ,MAX(IFF(pivot_column = 'Second_Col', pivot_attribute, NULL)) AS second_col
        ,MAX(IFF(pivot_column = 'Third_Col', pivot_attribute, NULL)) AS third_col
    FROM normalized_effective_at_each_point
    GROUP BY pivot_grain, driving_date
    ORDER BY pivot_grain, driving_date;
    

    giving the result:

    PIVOT_GRAIN  DRIVING_DATE   TO_DT       FIRST_COL   SECOND_COL  THIRD_COL
    PG-1         2020-01-01     2020-01-03  A           A           null
    PG-1         2020-01-03     2020-01-05  B           A           C
    PG-1         2020-01-05     9999-01-01  B           A           B
    PG-2         2020-02-02     2020-02-04  D           D           null
    PG-2         2020-02-04     2020-02-06  E           D           F
    PG-2         2020-02-06     9999-01-01  E           D           D
    

    I cannot help but think I have over mapped how I process data onto your PIVOT_GRAIN. And I started to trying solve this again from first principles now that I understand the code, and I think the first three processing CTE is how I would do it, thus the GROUP BY is also how I would do the rest, many JOIN seems really gross, and into Snowflake, I am more of a fan of this explode the data, and then MERGE (or GROUP BY) the data, as it's all nice and parallizable.