Search code examples
sqloracle-databaseoracle11gwindow-functionsanalytic-functions

oracle SQL select the distinct customers in the past x day rolling period


say you have a table of customers with dates as follows:
[customer_table]

+----------+-----------+----------+
| customer | date      | purchase |
+----------+-----------+----------+
| 1        | 1/01/2016 | 12       |
+----------+-----------+----------+
| 1        | 1/12/2016 | 3        |
+----------+-----------+----------+
| 2        | 5/03/2016 | 5        |
+----------+-----------+----------+
| 3        | 1/16/2016 | 6        |
+----------+-----------+----------+
| 3        | 3/22/2016 | 1        |
+----------+-----------+----------+  

I want to write a query to count how many distinct customers made a purchase in the last 10 days as a rolling period, starting from each calendar day and counting 10 days backwards. So for each unique day in 2016 the final output would be a calendar, where each day has a count of distinct customers that exist in the prior 10 days of that day of the calendar like the following:
[result_table]

+-----------+------------------+
| date      | unique customers |
+-----------+------------------+
| 1/01/2016 | 112              |
+-----------+------------------+
| 1/02/2016 | 104              |
+-----------+------------------+
| 1/03/2016 | 140              |
+-----------+------------------+
| 1/04/2016 | 133              |
+-----------+------------------+
| ....      | 121              |
+-----------+------------------+  

One solution I came up with was to create a calendar table that is one single column, then join the calendar table to the customer table with an inequality join. I believe this is extremely inefficient and am seeking a faster solution. So my first step was create a calendar like so:
[calendar]

+-----------+
| date      |
+-----------+
| 1/01/2016 |
+-----------+
| 1/02/2016 |
+-----------+
| 1/03/2016 |
+-----------+
| 1/04/2016 |
+-----------+
| 1/05/2016 |
+-----------+  

Then for each day in that calendar, to count the distinct set of customers prior to each day, I join with an inequality like so:

select
count(distinct customer) as unique customers
from calendar c
left join mytable m
on c.date>=m.date and m.date>=c.date-10  

While I believe this is correct, it runs extremely slow (say for 2 years of a calendar with a few million customers). Is there an oracle analytic function that may help me out here?


Solution

  • Is there an oracle analytic function that may help me out here?

    Not really - from the COUNT() documentation:

    If you specify DISTINCT, then you can specify only the query_partition_clause of the analytic_clause. The order_by_clause and windowing_clause are not allowed.

    You would want both DISTINCT and a windowing_clause which isn't allowed.

    Update:

    You can get the same effect as the invalid syntax using a combination of a non-DISTINCT analytic query partitioned by customer and then aggregation by day:

    Oracle Setup:

    CREATE TABLE table_name ( customer, dt ) AS
      SELECT 1, DATE '2017-01-10' FROM DUAL UNION ALL
      SELECT 1, DATE '2017-01-11' FROM DUAL UNION ALL
      SELECT 1, DATE '2017-01-15' FROM DUAL UNION ALL
      SELECT 1, DATE '2017-01-20' FROM DUAL UNION ALL
      SELECT 2, DATE '2017-01-12' FROM DUAL UNION ALL
      SELECT 2, DATE '2017-01-19' FROM DUAL UNION ALL
      SELECT 3, DATE '2017-01-10' FROM DUAL UNION ALL
      SELECT 3, DATE '2017-01-13' FROM DUAL UNION ALL
      SELECT 3, DATE '2017-01-15' FROM DUAL UNION ALL
      SELECT 3, DATE '2017-01-20' FROM DUAL;
    

    Query:

    Note: the query below is only for one month's data and for a range of two days preceding to illustrate the principle but it is easy enough to change the parameters to 12 months and 10 days.

    SELECT day,
           SUM( has_order_in_range ) AS unique_customers
    FROM   (
      SELECT customer,
             day,
             LEAST(
               1,
               COUNT(dt) OVER ( PARTITION BY customer
                                ORDER BY day
                                RANGE BETWEEN INTERVAL '2' DAY PRECEDING
                                          AND INTERVAL '0' DAY FOLLOWING )
             ) AS has_order_in_range
      FROM   table_name t
             PARTITION BY ( customer )
             RIGHT OUTER JOIN
             ( -- Create a calendar for one month
               SELECT DATE '2017-01-01' + LEVEL - 1 AS day
               FROM   DUAL
               CONNECT BY DATE '2017-01-01' + LEVEL - 1 < ADD_MONTHS( DATE '2017-01-01', 1 )
             ) d
             ON ( t.dt = d.day )
    )
    GROUP BY day
    ORDER BY day;
    

    Output:

    DAY                 UNIQUE_CUSTOMERS
    ------------------- ----------------
    2017-01-01 00:00:00                0
    2017-01-02 00:00:00                0
    2017-01-03 00:00:00                0
    2017-01-04 00:00:00                0
    2017-01-05 00:00:00                0
    2017-01-06 00:00:00                0
    2017-01-07 00:00:00                0
    2017-01-08 00:00:00                0
    2017-01-09 00:00:00                0
    2017-01-10 00:00:00                2
    2017-01-11 00:00:00                2
    2017-01-12 00:00:00                3
    2017-01-13 00:00:00                3
    2017-01-14 00:00:00                2
    2017-01-15 00:00:00                2
    2017-01-16 00:00:00                2
    2017-01-17 00:00:00                2
    2017-01-18 00:00:00                0
    2017-01-19 00:00:00                1
    2017-01-20 00:00:00                3
    2017-01-21 00:00:00                3
    2017-01-22 00:00:00                2
    2017-01-23 00:00:00                0
    2017-01-24 00:00:00                0
    2017-01-25 00:00:00                0
    2017-01-26 00:00:00                0
    2017-01-27 00:00:00                0
    2017-01-28 00:00:00                0
    2017-01-29 00:00:00                0
    2017-01-30 00:00:00                0
    2017-01-31 00:00:00                0