postgresql postgresql-9.3 window-functions

Count distinct users over n-days

My table consists of two fields, CalDay a timestamp field with time set on 00:00:00 and UserID. Together they form a compound key but it is important to have in mind that we have many rows for each given calendar day and there is no fixed number of rows for a given day.

Based on this dataset I would need to calculate how many distinct users there are over a set window of time, say 30d.

Using postgres 9.3 I cannot use COUNT(Distinct UserID) OVER ... nor I can work around the issue using DENSE_RANK() OVER (... RANGE BETWEEN) because RANGE only accepts UNBOUNDED.

So I went the old fashioned way and tried with a scalar subquery:

SELECT
  xx.*
 ,(
       SELECT COUNT(DISTINCT UserID) 
       FROM data_table AS yy
       WHERE yy.CalDay BETWEEN xx.CalDay - interval '30 days' AND xx.u_ts
  ) as rolling_count
FROM data_table AS xx
ORDER BY yy.CalDay

In theory, this should work, right? I am not sure yet because I started the query about 20 mins ago and it is still running. Here lies the problem, the dataset is still relatively small (25000 rows) but will grow over time. I would need something that scales and performs better.

I was thinking that maybe - just maybe - using the unix epoch instead of the timestamp could help but it is only a wild guess. Any suggestion would be welcome.

Solution

This should work. Can't comment on speed, but should be a lot less than your current one. Hopefully you have indexes on both these fields.

SELECT t1.calday, COUNT(DISTINCT t1.userid) AS daily, COUNT(DISTINCT t2.userid) AS last_30_days
FROM data_table t1
JOIN data_table t2
    ON t2.calday BETWEEN t1.calday - '30 days'::INTERVAL AND t1.calday
GROUP BY t1.calday

UPDATE

Tested it with a lot of data. The above works but is slow. Much faster to do it like this:

SELECT t1.*, COUNT(DISTINCT t2.userid) AS last_30_days
FROM (
    SELECT calday, COUNT(DISTINCT userid) AS daily
    FROM data_table
    GROUP BY calday
) t1
JOIN data_table t2
    ON t2.calday BETWEEN t1.calday - '30 days'::INTERVAL AND t1.calday
GROUP BY 1, 2

So instead of building up a massive table for all the JOIN combinations and then grouping/aggregating, it first gets the "daily" data, then joins the 30 day on that. Keeps the join much smaller and returns quickly (just under 1 second for 45000 rows in the source table on my system).