Search code examples
sqlhivecorrelated-subquerypresto

How to remove correlated subquery in my subselect (due to presto limitation)


I am trying to find the distinct count of users from the US that have performed a specific action (any row in p.action) within a 14 day rolling window, over the past ~2 months.

Here's the query. I'd love some tips on how I can re-write this to not use a correlated subquery, because Presto doesn't allow them.

SELECT dt,
    (SELECT COUNT(DISTINCT user_id)
     FROM p.action
     WHERE dt BETWEEN q.dt - 13 AND q.dt -- period of 14 days
        AND country = 'US'
     ) AS 14d_rolling_users
FROM p.action q
WHERE dt BETWEEN '2016-08-24' AND '2016-10-24'  
GROUP BY dt
ORDER BY dt ASC

I've been racking my head trying to figure out how I could accomplish this without just running 60 individual queries (one for each day).

Any help appreciated, thanks!


Solution

  • Best do a manual rollup.

    This turns each row in your table into 14 rows with additional rollup__ds timestamps. We then group by this new column to create a rolling 14 day window. The complexity of this is O(N*14) = O(N) and thus linear.

    SELECT 
      rollup__ds,
      COUNT(DISTINCT username)
    FROM (
      SELECT
        username,
        ds
      FROM
        actions
      WHERE 
        ds BETWEEN '2016-08-24' AND '2016-10-24'
        AND country = 'US'
    )
    CROSS JOIN 
      UNNEST(ARRAY[
        DATE_ADD('day', 0, CAST(ds AS DATE)),
        DATE_ADD('day', 1, CAST(ds AS DATE)),
        ...
        DATE_ADD('day', 12, CAST(ds AS DATE)),
        DATE_ADD('day', 13, CAST(ds AS DATE))
      ]) AS t (rollup__ds)
    GROUP BY 
      rollup__ds
    ORDER BY 
      rollup__ds
    ;
    

    Hope that helps!


    NB—if you do NOT need distinct count best use a window function, alas this does NOT work for distinct count because they are not summable like that.

    SELECT 
      ds, 
      -- BEWARE this count is NOT distinct!
      SUM(COUNT(username)) over (ORDER BY ds ROWS BEWTEEN 13 PRECEDING AND CURRENT ROW) 
    FROM
      actions
    WHERE
      ds BETWEEN '2016-08-24' AND '2016-10-24'
      AND country = 'US'
    ;