Search code examples
postgresqljoinsubquery

Simplifying aggregation query using subqueries in join


We have a table where total presence time on a given (period, volunteer) pair is saved and another in which actions are logged: we know which presence the action was taken in. But we do not track time spent on a given action. It is not a great data structure but I cannot change it.

CREATE TABLE volunteer_presence (id integer, volunteer_id integer, minutes integer);
CREATE TABLE logged_actions (id  integer, presence_id integer, section varchar);

What we would like to do is to approximate time spent on a given action by assuming time was split equally among all actions in a given presence. It's not great but we are happy to at least have that. For example let us say we have the following data:

INSERT INTO volunteer_presence (id, volunteer_id, minutes) values (1, 333, 400);
INSERT INTO volunteer_presence (id, volunteer_id, minutes) values (2, 444, 90);
INSERT INTO volunteer_presence (id, volunteer_id, minutes) values (3, 555, 80);

INSERT INTO logged_actions (id, presence_id, section) values (10, 1, 'Cats');
INSERT INTO logged_actions (id, presence_id, section) values (11, 1, 'Dogs');
INSERT INTO logged_actions (id, presence_id, section) values (13, 1, 'Cats');
INSERT INTO logged_actions (id, presence_id, section) values (14, 1, 'Cats');
INSERT INTO logged_actions (id, presence_id, section) values (15, 1, 'Front');
INSERT INTO logged_actions (id, presence_id, section) values (16, 2, 'Dogs');

We would expect the following result:

section presence_id estimated_minutes
Cats 1 240.0
Dogs 1 80.0
Front 1 80.0
Dogs 2 90.0

For example (3 out of 5 actions)*400 is 240.

The solution I came up with feels wrong with two subqueries in the join clauses. It feels it would not scale well.

SELECT  action_count_table.section,
        action_count_table.presence_id,
        ((action_count_table.total_action_count / presence_total_actions.action_count) * time_table.minutes) AS estimated_minutes
FROM (
  SELECT event.section AS section,
         volunteer_presence.id AS presence_id,
         SUM(event.action_count) AS total_action_count

  FROM (
    SELECT logged_actions.presence_id,
           logged_actions.section,
           COUNT(logged_actions.id) AS action_count
    FROM logged_actions
    GROUP BY logged_actions.presence_id, logged_actions.section
    ORDER BY logged_actions.presence_id, logged_actions.section
    ) AS event
  INNER JOIN volunteer_presence
  ON volunteer_presence.id = event.presence_id
  GROUP BY event.section, volunteer_presence.id
  ) AS action_count_table
INNER JOIN (
  SELECT DISTINCT logged_actions.presence_id AS presence_id,
                  volunteer_presence.minutes AS minutes
  FROM logged_actions
  INNER JOIN volunteer_presence
  ON volunteer_presence.id = logged_actions.presence_id
  ) AS time_table
ON action_count_table.presence_id = time_table.presence_id
INNER JOIN (
     SELECT logged_actions.presence_id,
           COUNT(logged_actions.id) AS action_count
    FROM logged_actions
    GROUP BY logged_actions.presence_id
    ORDER BY logged_actions.presence_id
) AS presence_total_actions
ON presence_total_actions.presence_id = time_table.presence_id

Is there a better solution, that would scale better or at least be simpler ?


Solution

  • step-by-step demo:db<>fiddle

    SELECT
        la.presence_id,
        la.section,
        vp.minutes * COUNT(*) / total                            -- 4
    FROM (
        SELECT 
            *,
            COUNT(*) OVER (PARTITION BY presence_id) as total    -- 1
        FROM logged_actions
    ) la
    JOIN volunteer_presence vp ON vp.id = la.presence_id         -- 2
    GROUP BY la.presence_id, la.section, la.total, vp.minutes    -- 3
    ORDER BY la.presence_id, la.section
    
    1. The COUNT() window function adds the total count per presence_id to each record
    2. Join the volunteer_presence
    3. Grouping
    4. Calculate the count for each section per presence_id, divide by the previously calculated total count per presence_id and multiply with the minutes.