Search code examples
sqlverticadailybuilds

SQL - Vertica: How to generate daily rows with most previous date data


I have a base table like below:

score_upd (Upd_dt,Url,Score) AS (
          SELECT DATE '2019-07-26','A','x'
UNION ALL SELECT DATE '2019-07-26','B','alpha'
UNION ALL SELECT DATE '2019-08-01','A','y'
UNION ALL SELECT DATE '2019-08-01','B','beta'
UNION ALL SELECT DATE '2019-08-03','A','z'
UNION ALL SELECT DATE '2019-08-03','B','gamma'
)

   Upd_dt       URL    Score
 2019-07-26      A       x
 2019-07-26      B      alpha 
 2019-08-01      A       y
 2019-08-01      B      beta
 2019-08-03      A       z
 2019-08-03      B      gamma

And I want to create a table in daily-url level, using most previous date's value for the new rows, result should look like below:

score_upd (Upd_dt,Url,Score) AS (
          SELECT DATE '2019-07-26','A','x'
UNION ALL SELECT DATE '2019-07-26','B','alpha'
UNION ALL SELECT DATE '2019-07-27','A','x'
UNION ALL SELECT DATE '2019-07-27','B','alpha'
UNION ALL SELECT DATE '2019-07-28','A','x'
UNION ALL SELECT DATE '2019-07-28','B','alpha'
UNION ALL SELECT DATE '2019-07-29','A','x'
UNION ALL SELECT DATE '2019-07-29','B','alpha'
UNION ALL SELECT DATE '2019-07-30','A','x'
UNION ALL SELECT DATE '2019-07-30','B','alpha'
UNION ALL SELECT DATE '2019-07-31','A','x'
UNION ALL SELECT DATE '2019-07-31','B','alpha'
UNION ALL SELECT DATE '2019-08-01','A','y'
UNION ALL SELECT DATE '2019-08-01','B','beta'
UNION ALL SELECT DATE '2019-08-02','A','y'
UNION ALL SELECT DATE '2019-08-02','B','beta'
UNION ALL SELECT DATE '2019-08-03','A','z'
UNION ALL SELECT DATE '2019-08-03','B','gamma'
UNION ALL SELECT DATE '2019-08-04','A','z'
UNION ALL SELECT DATE '2019-08-04','B','gamma'
UNION ALL SELECT DATE '2019-08-05','A','z'
UNION ALL SELECT DATE '2019-08-05','B','gamma'
) 

Which looks like:

   Upd_dt       URL    Score 
 2019-07-26      A       x
 2019-07-26      B      alpha 
 2019-07-27      A       x
 2019-07-27      B      alpha 
 2019-07-28      A       x
 2019-07-28      B      alpha 
 2019-07-29      A       x
 2019-07-29      B      alpha 
 2019-07-30      A       x
 2019-07-30      B      alpha 
 2019-07-31      A       x
 2019-07-31      B      alpha 
 2019-08-01      A       y
 2019-08-01      B      beta
 2019-08-02      A       y
 2019-08-02      B      beta
 2019-08-03      A       z
 2019-08-03      B      gamma
 2019-08-04      A       z
 2019-08-04      B      gamma
 2019-08-05      A       z
 2019-08-05      B      gamma
.
.
.

Current process is: I built a daily dimension table since 7/26/2019 till today by:

/* SELECT CAST(slice_time AS DATE) dates FROM testcalendar mtc TIMESERIES slice_time as '1 day' OVER (ORDER BY CAST(mtc.dates as TIMESTAMP)); */

so I get:

Dates

2019-07-26

2019-07-27

2019-07-28

2019-07-29

.

.

.

2019-10-12 (today)

I'm thinking if I can use function such as "interpolate previous value" to join my first table by dates, to generate missing days by using values from most previous date data, while it failed.

The result didn't generate rows for missing days.

Please let me know if anyone has any better idea on this.

Thanks!


Solution

  • As a starting warning : only store a "daily photograph" when it really, really is necessary. In my past, I once ended up having 364 rows too many per year, as the values only changed once a year. In Vertica, that costs license, and CPU and clock time for joining and grouping ...

    But, for the rest - Good start.

    But you could apply the TIMESERIES without having to build a calendar.

    The trick is to "extrapolate" manually what you can INTERPOLATE automatically.

    Add an in-line 'padding' table, which contains the newest value per URL, but give it CURRENT_DATE instead of the newest actual date - using Vertica's peculiar analytic limit clause LIMIT 1 OVER(PARTITION BY url ORDER BY upd_dt DESC) .

    UNION SELECT that padding table with your input, and apply the TIMESERIES clause to that UNION SELECT.

    Like so:

    WITH
    -- your input ...
    score_upd (Upd_dt,Url,Score) AS (
              SELECT DATE '2019-07-26','A','x'
    UNION ALL SELECT DATE '2019-07-26','B','alpha'
    UNION ALL SELECT DATE '2019-08-01','A','y'
    UNION ALL SELECT DATE '2019-08-01','B','beta'
    UNION ALL SELECT DATE '2019-08-03','A','z'
    UNION ALL SELECT DATE '2019-08-03','B','gamma'
    )
    -- real WITH clause would start here ...                                                                                                                                                 
    ,
    -- newest row per Url, just with current date
    pad_newest AS (
    SELECT
      CURRENT_DATE
    , url 
    , score
    FROM score_upd
    LIMIT 1 OVER(PARTITION BY url ORDER BY upd_dt DESC)
    )   
    ,   
    with_newest AS (
    SELECT
      *   
    FROM score_upd
    UNION ALL 
    SELECT *
    FROM pad_newest
    )   
    SELECT
      ts_dt::DATE           AS upd_dt
    , url                   AS url 
    , TS_FIRST_VALUE(score) AS score
    FROM with_newest
    TIMESERIES ts_dt AS '1 day' OVER (
      PARTITION BY url ORDER BY upd_dt::TIMESTAMP
    )   
    ORDER BY 1,2 
    ;