Search code examples
sqlteradatateradata-sql-assistant

SQL count occurrences in window


I have user logins by date. My requirement is to track the number of users that have been logged in during the past 90 days window.

I am new to both SQL in general and Teradata specifically and I can't get the window functionality to work as I need.

I need the following result, where ACTIVE is a count of the unique USER_IDs that appear in the previous 90 day window the DATE.

DATES        ACTIVE_IN_WINDOW
12/06/2018     20
13/06/2018     45                 
14/06/2018     65 
15/06/2018     73 
17/06/2018     24      
18/06/2018     87  
19/06/2018     34
20/06/2018     51

Currently my script is as follows.

It is this line here that I cant get right

COUNT ( USER_ID) OVER (PARTITION BY USER_ID ORDER BY EVT_DT ROWS BETWEEN 90 PRECEDING AND  0 FOLLOWING)

I suspect I need a different set of functions to make this work.

SELECT    b.DATES , a.ACTIVE_IN_WINDOW

FROM    

(
        SELECT 

        CAST(CALENDAR_DATE AS DATE) AS DATES FROM SYS_CALENDAR.CALENDAR

        WHERE DATES BETWEEN ADD_MONTHS(CURRENT_DATE, - 10)  AND CURRENT_DATE
) b

LEFT JOIN

(
        SELECT    USER_ID   , EVT_DT 

        , COUNT ( USER_ID) OVER (PARTITION BY USER_ID ORDER BY EVT_DT ROWS BETWEEN 90 PRECEDING AND  0 FOLLOWING) AS ACTIVE_IN_WINDOW

        FROM ENV0.R_ONBOARDING
) a

ON a.EVT_DT = b.DATES

ORDER BY b.DATES

Thank you for any assistance.


Solution

  • The logic is similar to Gordon', but a non-equi-Join instead of a Correlated Scalar Subquery is usually more efficient on Teradata:

    SELECT b.DATES , Count(DISTINCT USER_ID)
    FROM
     (
       SELECT CALENDAR_DATE AS DATES 
       FROM SYS_CALENDAR.CALENDAR
       WHERE DATES BETWEEN Add_Months(Current_Date, - 10)  AND Current_Date
     ) b
    LEFT JOIN
     ( -- apply DISTINCT before aggregation to reduce intermediate spool
       SELECT DISTINCT USER_ID, EVT_DT
       FROM ENV0.R_ONBOARDING
     ) AS a
    ON a.EVT_DT BETWEEN Add_Months(b.DATES,-3) AND b.DATES
    GROUP BY 1
    ORDER BY 1
    

    Of course this will require a large spool and much CPU.

    Edit:

    Switching to weeks reduces the overhead, I'm using dates instead of week numbers (it's easier to modify for other ranges):

    SELECT b.Week , Count(DISTINCT USER_ID) 
    FROM
     ( -- Return only Mondays instead of DISTINCT over all days 
       SELECT calendar_date AS Week
       FROM SYS_CALENDAR.CALENDAR 
       WHERE CALENDAR_DATE BETWEEN Add_Months(Current_Date, -9) AND Current_Date
         AND day_of_week = 2 -- 2 = Monday
     ) b 
    LEFT JOIN 
     (
       SELECT DISTINCT USER_ID,
         -- td_monday returns the previous Monday, but we need the following monday
         -- covers the previous Tuesday up to the current Monday
                Td_Monday(EVT_DT+6) AS PERIOD_WEEK
       FROM ENV0.R_ONBOARDING
       -- You should add another condition to limit the actually covered date range, e.g.
       -- where EVT_DT BETWEEN Add_Months(b.DATES,-13) AND b.DATES
     ) AS a 
    ON a.PERIOD_WEEK BETWEEN b.Week-(12*7) AND b.Week 
    GROUP BY 1 
    ORDER BY 1 
    

    Explain should duplicate the calendar as preparation for the product join, if not you might need to materialize the dates in a Volatile Table. Better don't use sys_calendar, there are no statistics, e.g. optimizer doesn't know about how many days per week/month/year, etc. Check your system, there should be a calendar table designed for you company needs (with stats on all columns)