I'm trying to look at the number of active users of a product (toy example) over the last 30 days.
I'm considering two approaches.
One, date_sub
is used to find the date 29 days before (the interval is 30 days inclusive of the start date) an end date. The where
window is then defined by that earlier date and the end date.
That is this example:
SELECT
activity_date AS day,
COUNT(DISTINCT user_id) AS active_users
FROM Activity
WHERE
activity_date >= DATE_SUB("2019-07-27", INTERVAL 29 DAY)
AND
activity_date >= "2019-07-27"
A second approach is to calculate the datediff
from a start date, then restrict the where clause to the previous time period.
SELECT
activity_date as day,
COUNT(DISTINCT user_id) AS active_users
FROM Activity
WHERE
datediff('2019-07-27', activity_date) < 30
AND
activity_date <= '2019-07-27'
I have no insight into which is the better option. I'd love for others to weigh in.
Use the first option:
activity_date
BETWEEN DATE_SUB(DATE("2019-07-27"), INTERVAL 29 DAY)
AND DATE("2019-07-27")
This compares the stored value directly to date litterals. Such an expression can take advantage of an index on the date column.
In, constrast the second expression applies date function datediff()
to the date column. This makes the expression non-SARGable, meaning that it will not benefit an index:
datediff('2019-07-27', activity_date) < 30
and activity_date <= '2019-07-27'
Note that the first expression could be simply phrased:
activity_date >= '2019-07-27' - interval 29 day
and activity_date <= '2019-07-27'
I am unsure whether the second comparison should be >=
rather than >
. A reason why it would make sense is that activitydate
has no time component. But I would recomment using <
, because it works for both cases; if you want data up until '2019-07-27'
included, you can do:
activity_date >= '2019-07-27' - interval 29 day
and activity_date < '2019-07-28'