Search code examples
sqlpostgresqltime-seriesaggregategreatest-n-per-group

Get apps with the highest review count since a dynamic series of days


I have two tables, apps and reviews (simplified for the sake of discussion):

apps table

id int

reviews table

id          int
review_date date
app_id      int (foreign key that points to apps)

2 questions:

1. How can I write a query / function to answer the following question?:

Given a series of dates from the earliest reviews.review_date to the latest reviews.review_date (incrementing by a day), for each date, D, which apps had the most reviews if the app's earliest review was on or later than D?

I think I know how to write a query if given an explicit date:

SELECT
  apps.id,
  count(reviews.*)
FROM
  reviews
  INNER JOIN apps ON apps.id = reviews.app_id
group by
  1
having
  min(reviews.review_date) >= '2020-01-01'
  order by 2 desc
limit 10;

But I don't know how to query this dynamically given the desired date series and compile all this information in a single view.

2. What's the best way to model this data?

It would be nice to have the # of reviews at the time for each date as well as the app_id. As of now I'm thinking something that might look like:

... 2020-01-01_app_id | 2020-01-01_review_count | 2020-01-02_app_id | 2020-01-02_review_count ...

But I'm wondering if there's a better way to do this. Stitching the data together also seems like a challenge.


Solution

  • I think this is what you are looking for:

    Postgres 13 or newer

    WITH cte AS (  -- MATERIALIZED
       SELECT app_id, min(review_date) AS earliest_review, count(*)::int AS total_ct
       FROM   reviews
       GROUP  BY 1
       )
    SELECT *
    FROM  (
       SELECT generate_series(min(review_date)
                            , max(review_date)
                            , '1 day')::date
       FROM   reviews
       ) d(review_window_start)
    LEFT  JOIN LATERAL (
       SELECT total_ct, array_agg(app_id) AS apps
       FROM  (
          SELECT app_id, total_ct
          FROM   cte c
          WHERE  c.earliest_review >= d.review_window_start
          ORDER  BY total_ct DESC
          FETCH  FIRST 1 ROWS WITH TIES  -- new & hot
          ) sub
       GROUP  BY 1
       ) a ON true;
    

    WITH TIES makes it a bit cheaper. Added in Postgres 13 (currently beta). See:

    Postgres 12 or older

    WITH cte AS (  -- MATERIALIZED
       SELECT app_id, min(review_date) AS earliest_review, count(*)::int AS total_ct
       FROM   reviews
       GROUP  BY 1
       )
    SELECT *
    FROM  (
       SELECT generate_series(min(review_date)
                            , max(review_date)
                            , '1 day')::date
       FROM   reviews
       ) d(review_window_start)
    LEFT  JOIN LATERAL (
       SELECT total_ct, array_agg(app_id) AS apps
       FROM  (
          SELECT total_ct, app_id
              ,  rank() OVER (ORDER BY total_ct DESC) AS rnk
          FROM   cte c
          WHERE  c.earliest_review >= d.review_window_start
          ) sub
       WHERE  rnk = 1
       GROUP  BY 1
       ) a ON true;
    

    db<>fiddle here

    Same as above, but without WITH TIES.

    We don't need to involve the table apps at all. The table reviews has all information we need.

    The CTE cte computes earliest review & current total count per app. The CTE avoids repeated computation. Should help quite a bit.
    It is always materialized before Postgres 12, and should be materialized automatically in Postgres 12 since it is used many times in the main query. Else you could add the keyword MATERIALIZED in Postgres 12 or later to force it. See:

    The optimized generate_series() call produces the series of days from earliest to latest review. See:

    Finally, the LEFT JOIN LATERAL you already discovered. But since multiple apps can tie for the most reviews, retrieve all winners, which can be 0 - n apps. The query aggregates all daily winners into an array, so we get a single result row per review_window_start. Alternatively, define tiebreaker(s) to get at most one winner. See: