I am trying to find the distinct count of users from the US that have performed a specific action (any row in p.action) within a 14 day rolling window, over the past ~2 months.
Here's the query. I'd love some tips on how I can re-write this to not use a correlated subquery, because Presto doesn't allow them.
SELECT dt,
(SELECT COUNT(DISTINCT user_id)
FROM p.action
WHERE dt BETWEEN q.dt - 13 AND q.dt -- period of 14 days
AND country = 'US'
) AS 14d_rolling_users
FROM p.action q
WHERE dt BETWEEN '2016-08-24' AND '2016-10-24'
GROUP BY dt
ORDER BY dt ASC
I've been racking my head trying to figure out how I could accomplish this without just running 60 individual queries (one for each day).
Any help appreciated, thanks!
Best do a manual rollup.
This turns each row in your table into 14 rows with additional rollup__ds
timestamps. We then group by this new column to create a rolling 14 day window. The complexity of this is O(N*14) = O(N)
and thus linear.
SELECT
rollup__ds,
COUNT(DISTINCT username)
FROM (
SELECT
username,
ds
FROM
actions
WHERE
ds BETWEEN '2016-08-24' AND '2016-10-24'
AND country = 'US'
)
CROSS JOIN
UNNEST(ARRAY[
DATE_ADD('day', 0, CAST(ds AS DATE)),
DATE_ADD('day', 1, CAST(ds AS DATE)),
...
DATE_ADD('day', 12, CAST(ds AS DATE)),
DATE_ADD('day', 13, CAST(ds AS DATE))
]) AS t (rollup__ds)
GROUP BY
rollup__ds
ORDER BY
rollup__ds
;
Hope that helps!
NB—if you do NOT need distinct count best use a window function, alas this does NOT work for distinct count because they are not summable like that.
SELECT
ds,
-- BEWARE this count is NOT distinct!
SUM(COUNT(username)) over (ORDER BY ds ROWS BEWTEEN 13 PRECEDING AND CURRENT ROW)
FROM
actions
WHERE
ds BETWEEN '2016-08-24' AND '2016-10-24'
AND country = 'US'
;