Search code examples
sqlruby-on-railspostgresqlgreatest-n-per-group

PostgreSQL Select highest value within time period


Using Rails, I'm trying to perform a SQL command to return an array of rows that contain the highest value for a certain user per day.

For example:

  user_id(integer) |        created_at(datetime)    | score(integer)
-------------------+--------------------------------+---------------
              1    |      "2015-07-27 21:35:24"     |         100
              1    |      "2015-07-27 21:35:24"     |         123
              2    |      "2015-07-27 21:35:24"     |         101
              2    |      "2015-07-27 21:35:24"     |         122
              3    |      "2015-07-27 21:35:24"     |         103
              3    |      "2015-07-27 21:35:24"     |         115
              1    |      "2015-07-26 21:35:24"     |         116
              1    |      "2015-07-26 21:35:24"     |         151
              2    |      "2015-07-26 21:35:24"     |         122
              2    |      "2015-07-26 21:35:24"     |         134
              3    |      "2015-07-26 21:35:24"     |         123
              3    |      "2015-07-26 21:35:24"     |         111
              1    |      "2015-07-25 21:35:24"     |         129
              1    |      "2015-07-25 21:35:24"     |         152
              2    |      "2015-07-25 21:35:24"     |         120
              2    |      "2015-07-25 21:35:24"     |         109
              3    |      "2015-07-25 21:35:24"     |         142
              3    |      "2015-07-25 21:35:24"     |         131

Expected results:

  user_id(integer) |        created_at(datetime)    | score(integer)
-------------------+--------------------------------+---------------
              1    |      "2015-07-27 21:35:24"     |         123
              2    |      "2015-07-27 21:35:24"     |         122
              3    |      "2015-07-27 21:35:24"     |         115
              1    |      "2015-07-26 21:35:24"     |         151
              2    |      "2015-07-26 21:35:24"     |         134
              3    |      "2015-07-26 21:35:24"     |         123
              1    |      "2015-07-25 21:35:24"     |         152
              2    |      "2015-07-25 21:35:24"     |         120
              3    |      "2015-07-25 21:35:24"     |         142

I've been combining various joins, having, and other methods, but to no avail. I can't get it to filter the results. I've made a little bit of progress by selecting max values per day, but then I can't filter out the lower ones on a per user_id basis. I managed to do it using multiple group_by and map in Rails, but it's super slow since it has to re-iterate over the entire array, and since there is a lot of records, it can take some time.

EDIT:

My solution was as follows:

    all_scores_in_time_period = UserScore
      .where("EXTRACT(MONTH FROM created_at) = ?", Date::MONTHNAMES.index(params[:month_control]))
      .where("EXTRACT(YEAR FROM created_at) = ?", params[:year_control])
      .select("DISTINCT ON(DATE(created_at), user_id) *")
      .order("DATE(created_at) desc")

This first filters by month/year, and then returns a list of users by their top score per user per day.


Solution

  • You can use distinct on () which is a Postgres extension to the distinct operator:

    select distinct on (user_id, created_at) user_id, created_at, score
    from the_table
    order by user_id, created_at, score desc;
    

    If you want solution using standard SQL, this can also be achieved using window functions:

    select user_id, created_at, score
    from (
      select user_id, created_at, score, 
             row_number() over (partition by user_id, created_at order by score desc) as rn
      from the_table
    ) as t
    order by user_id, created_at;
    

    The solution with distinct on() is typically faster in Postgres.

    With a window function you could also handle ties: when a user has the same (highest) score more than once on a day. The solution with row_number() will only return a single row per (user_id, created_at). If you want all rows with the same (highest) score, you need to use dense_rank() instead.

    Edit

    If you want to ignore the time part of the timestamp column, just cast it to a date:

    created_at::date