I would like to calculate the rolling number of active users that were also active x or more times in the previous y days. For simplicity lets say >3 times in 5 days
The data I have is of the form
| date | user_id |
------------------
|2019-01-01 | user1 |
|2019-01-01 | user2 |
|2019-01-01 | user1 |
|2019-01-02 | user1 |
|2019-01-02 | user3 |
|2019-01-02 | user4 |
|2019-01-03 | user2 |
|2019-01-03 | user3 |
I have tried to use aggregate functions to group user_id by date and the having a window function sum the users that exist >3 times in the 5 preceding rows. Aggregate the users by date is fine using:
SELECT date, ARRAY_AGG(distinct user_id) as users
FROM `table`
WHERE date > DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
GROUP BY date
Returning an array structure like this.
| date | users |
------------------
|2019-01-01 | user1 |
| | user2 |
|2019-01-02 | user1 |
| | user3 |
| | user4 |
|2019-01-03 | user2 |
| | user3 |
But here I'm stuck. Can I do something like this? But what analytical function should I use?
WITH activity as (
SELECT date, ARRAY_AGG(distinct user_id) as users
FROM `table`
WHERE date > DATE_SUB(CURRENT_DATE(), INTERVAL 30 DAY)
GROUP BY date
)
SELECT date, xxxx OVER (PARTITION BY date ORDER BY date ROWS 5 PRECEDING) as returning_users
FROM activity
The output that I would like is:
| date | returning_users |
------------------
|2019-01-01 | 123 |
|2019-01-02 | 1234 |
|2019-01-03 | 12345 |
You may well be right that there are some suitable window functions to achieve this, but I am not familiar with them if there are.
However since there is no existing solution, here is a working method just using joins:
WITH ACTIVE_DAYS AS (
SELECT
a.date,
a.user_id,
COUNT(DISTINCT b.date) AS ActiveDays
FROM
DATA a
JOIN
DATA b
ON
a.user_id = b.user_id
AND b.date BETWEEN DATE_SUB(a.date, INTERVAL 5 DAY) AND DATE_SUB(a.date, INTERVAL 1 DAY)
GROUP BY
1,
2)
SELECT
a.date,
COUNT(DISTINCT a.user_id) AS DAU,
COUNT(DISTINCT (CASE WHEN b.ActiveDays >= 3 THEN a.user_id END)) AS DAU_Meet_Criteria
FROM
DATA a
LEFT JOIN
ACTIVE_DAYS b
ON
a.date = b.date
AND a.user_id = b.user_id
GROUP BY
1
ORDER BY
1
By editing the date_sub
functions and the case statement in the second count(distinct...
statement you could achieve some alternative logic.
Hope this helps.