Search code examples
sqlpostgresqlwindow-functionsstar-schema

Postgresql - window functions within window functions


Facing a query design issue and not sure about whether my approach to the problem is unnecessarily complicated:

I have a fact table:

       Column   |            Type             |                       Modifiers                       
------------+-----------------------------+-------------------------------------------------------
 id         | integer                     | not null default nextval('messages_id_seq'::regclass)
 type       | character varying(255)      | 
 ts         | numeric                     | 
 text       | text                        | 
 score      | double precision            | 
 user_id    | integer                     | 
 channel_id | integer                     | 
 time_id    | integer                     | 
 created_at | timestamp without time zone | 
 updated_at | timestamp without time zone | 

I am running some analytical queries against it currently, one of which (for example) would be:

  with intervals as (
  select 
    (select '09/27/2014'::date) + (n      || ' minutes')::interval start_time,
    (select '09/27/2014'::date) + ((n+60) || ' minutes')::interval end_time
      from generate_series(0, (24*60*7), 60 * 4) n
  )
  select 
    extract(epoch from i.start_time)::numeric * 1000 as ts, 
    extract(epoch from i.end_time)::numeric * 1000 as end_ts,
    sum(avg(messages.score)) over (order by i.start_time) as score

  from messages
  right join intervals i
    on messages.timestamp >= i.start_time and messages.timestamp < i.end_time

  where messages.timestamp between '09/27/2014' and '10/04/2014'

  group by i.start_time, i.end_time 
  order by i.start_time

As you guys can probably tell - this query computes the average of the "score" attribute for messages for a given time-bucket distribution and then alongside that computes a cumulative across the buckets (using the window).

What I am trying to do next is find the top 5 (for example) messages.text that are closest to the average for each bucket.

Right now, the only plan I have is to:

1) Join messages with the time-buckets
2) Compute a score - avg(score) over (partition by start_time) as deviation and save it against each record of the joined relation
3) Compute a rank() over (order by deviation) as rank
4) Select where rank between 1 and 5

The reason I have put this down imperatively in steps in because my first attempt at coming up with a design involved using a window function within a window function (rank() over (partition by start_time, order by score - avg(score) over (partition by start_time)) and I wasn't even going to attempt that to see if it would work.

Can I please get some advice about whether I'm headed in the right direction?


Solution

  • Whelp - here is what I have and seems to work:

    Now open for criticism is the structuring of, performance optimizations of and redundancy in my query! ^_^ (minus generating the timeseries directly instead of all the twisted intervals math which I will fix eventually!)

    with intervals as (
        select 
            (select '09/29/2014'::date) + (n      || ' minutes')::interval start_time,
            (select '09/29/2014'::date) + ((n+60) || ' minutes')::interval end_time
            from generate_series(0, (24*60*7), 60 * 4) n
    ), intervaled_messages as (
        select
            extract(epoch from i.start_time)::numeric * 1000 as ts, 
            extract(epoch from i.end_time)::numeric * 1000 as end_ts,
            abs(score - avg(score) over (partition by i.start_time)) as deviation
        from messages
        right join intervals i
            on messages.timestamp >= i.start_time and messages.timestamp < i.end_time
        where messages.timestamp between '09/29/2014' and '10/06/2014'
    ), ranked_messages as (
        select ts, end_ts, deviation, 
        rank() over (partition by ts order by deviation) as rank,
        row_number() over (partition by ts order by deviation) as row_number
        from intervaled_messages
    )
    select ts, end_ts, deviation, rank 
    from ranked_messages 
    where rank between 1 and 5
      and row_number between 1 and 5
    order by ts;