Search code examples
sqlwindow-functionsamazon-athena

How to ensure every date and deployment is present and window function doesn't skip days because data not present?


Working in Athena, I have a query that generates a table with several measures for each iteration of date and deployment.

WITH
  core AS (
   SELECT /* create dataset for window functions to work on */
     "deployment"
   , "macaddress"
   , "date"
   FROM 
     (
      SELECT /* de-dupe the base dataset based on a macaddress being valid once per day per deployment */
        "_deployment" "deployment"
      , "macaddress"
      , "date"("timestamp") "date"
      , "row_number"() OVER (PARTITION BY "_deployment", "macaddress" ORDER BY "date"("timestamp") ASC) "rn"
      FROM
        registration
   ) 
   WHERE (CAST("rn" AS varchar(2)) = '1')
) 
SELECT /* window functions to aggregate */
  "count"("macaddress") "registrations"
, "deployment"
, "date"
, "lag"("count"("macaddress"), 1) OVER (PARTITION BY "deployment" ORDER BY "date" ASC) "yesterdayCount"
, "sum"("count"("macaddress")) OVER (PARTITION BY "deployment" ORDER BY "date" ASC ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) "Past7Days"
, "sum"("count"("macaddress")) OVER (PARTITION BY "deployment" ORDER BY "date" ASC ROWS BETWEEN 13 PRECEDING AND 7 PRECEDING) "7daysTarget"
, "sum"("count"("macaddress")) OVER (PARTITION BY "deployment" ORDER BY "date" ASC ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) "Past30Days"
, "sum"("count"("macaddress")) OVER (PARTITION BY "deployment" ORDER BY "date" ASC ROWS BETWEEN 59 PRECEDING AND 30 PRECEDING) "30daysTarget"
, "avg"("count"("macaddress")) OVER (PARTITION BY "deployment" ORDER BY "date" ASC ROWS BETWEEN 9 PRECEDING AND CURRENT ROW) "10DayTrend"
FROM
  core
GROUP BY "deployment", "date"

The problem I am having is related to the fact that not every deployment has data for every date, thus the window functions are not working in the intended way, they are summing the previous 7 rows, not the previous 7 days.

The output that I am after is a table with a row for every date between the earliest and latest in the dataset (the latest date will keep moving, the table is updated each night) and each deployment present in the dataset, whether there is data in the dataset or not.

So, if deployment 1 has data for the following days:

2020-04-1
2020-04-2
2020-04-3
2020-04-5
2020-04-6
2020-04-7
2020-04-8

I would want the Past7Days field for 2020-04-8 to be the sum of 2020-04-2 to 2020-04-08. At the moment the result is the sum of 2020-04-01 to 2020-04-08 as it is looking back rows, not dates and 2020-04-4 is missing.

The final table needs to be in the format:

registrations | deployment | date | yesterdayCount | Past7Days | 7daysTarget | Past30Days | 30daysTarget | 10DayTrend

and have no gaps for date or deployment.

My working theory is that I need to establish this on the initial innermost SELECT statement (The deduping one). I have a calendar table and have been joining to it, but I am being scuppered by the fact that there isn't a deployment field in the calendar, and so the join is incomplete consisting only of date. If at all possible I want to avoid having a calendar table per deployment.

Thank you in advance for any help.


Solution

  • just create your deployment calendar by cross joining the deployments with the calendartable. Something like

    WITH
      core AS (
       SELECT /* create dataset for window functions to work on */
         "deployment"
       , "macaddress"
       , "date"
       FROM 
         (
          SELECT /* de-dupe the base dataset based on a macaddress being valid once per day per deployment */
            "_deployment" "deployment"
          , "macaddress"
          , "date"("timestamp") "date"
          , "row_number"() OVER (PARTITION BY "_deployment", "macaddress" ORDER BY "date"("timestamp") ASC) "rn"
          FROM
            registration
       ) 
       WHERE (CAST("rn" AS varchar(2)) = '1')
    ) ,
    deploymentcalendar as (
    select t.deployment,ct.date 
    from (select distinct deploymet from core) t
    cross join calendar_table ct
    )
    SELECT /* window functions to aggregate */
      "count"("macaddress") "registrations"
    , "deployment"
    , "date"
    , "lag"("count"("macaddress"), 1) OVER (PARTITION BY dc."deployment" ORDER BY "date" ASC) "yesterdayCount"
    , "sum"("count"("macaddress")) OVER (PARTITION BY dc."deployment" ORDER BY dc."date" ASC ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) "Past7Days"
    , "sum"("count"("macaddress")) OVER (PARTITION BY dc."deployment" ORDER BY dc."date" ASC ROWS BETWEEN 13 PRECEDING AND 7 PRECEDING) "7daysTarget"
    , "sum"("count"("macaddress")) OVER (PARTITION BY dc."deployment" ORDER BY dc."date" ASC ROWS BETWEEN 29 PRECEDING AND CURRENT ROW) "Past30Days"
    , "sum"("count"("macaddress")) OVER (PARTITION BY dc."deployment" ORDER BY dc."date" ASC ROWS BETWEEN 59 PRECEDING AND 30 PRECEDING) "30daysTarget"
    , "avg"("count"("macaddress")) OVER (PARTITION BY dc."deployment" ORDER BY dc."date" ASC ROWS BETWEEN 9 PRECEDING AND CURRENT ROW) "10DayTrend"
    FROM
      core
    RIGHT JOIN deploymentcalendar dc on dc.deployment=core.deployment and dc.date=core.date
    GROUP BY core."deployment", core."date"