Search code examples
mysqlsqlquery-optimizationwhere-clausedate-arithmetic

SQL date_sub vs datediff performance in looking over a date window


I'm trying to look at the number of active users of a product (toy example) over the last 30 days.

I'm considering two approaches.

One, date_sub is used to find the date 29 days before (the interval is 30 days inclusive of the start date) an end date. The where window is then defined by that earlier date and the end date.

That is this example:

SELECT 
    activity_date AS day, 
    COUNT(DISTINCT user_id) AS active_users
FROM Activity
WHERE 
    activity_date >= DATE_SUB("2019-07-27", INTERVAL 29 DAY) 
      AND 
    activity_date >= "2019-07-27" 

A second approach is to calculate the datediff from a start date, then restrict the where clause to the previous time period.

SELECT
    activity_date as day,
    COUNT(DISTINCT user_id) AS active_users
FROM Activity
WHERE 
    datediff('2019-07-27', activity_date) < 30
      AND
    activity_date <= '2019-07-27'

I have no insight into which is the better option. I'd love for others to weigh in.


Solution

  • Use the first option:

    activity_date 
        BETWEEN DATE_SUB(DATE("2019-07-27"), INTERVAL 29 DAY) 
        AND DATE("2019-07-27") 
    

    This compares the stored value directly to date litterals. Such an expression can take advantage of an index on the date column.

    In, constrast the second expression applies date function datediff() to the date column. This makes the expression non-SARGable, meaning that it will not benefit an index:

    datediff('2019-07-27', activity_date) < 30
    and activity_date <= '2019-07-27'
    

    Note that the first expression could be simply phrased:

    activity_date >= '2019-07-27' - interval 29 day
    and activity_date <= '2019-07-27'
    

    I am unsure whether the second comparison should be >= rather than >. A reason why it would make sense is that activitydate has no time component. But I would recomment using <, because it works for both cases; if you want data up until '2019-07-27' included, you can do:

    activity_date >= '2019-07-27' - interval 29 day
    and activity_date < '2019-07-28'