I have a system which calculates the work time of a user on a specific task. The main issue is that the tasks take a variable amount of time to complete. Some take a few hours and some take a few days or even a week.
Each task has a start and a stop time as well as pauses in between:
+------+---------------------+---------------------+
| _id | _start_date | _finish_date |
+------+---------------------+---------------------+
| 5013 | 2021-06-28 15:23:00 | 2021-06-29 09:25:01 |
| 5020 | 2021-06-28 15:33:00 | 2021-06-28 15:35:57 |
| 5025 | 2021-06-28 15:41:00 | 2021-07-06 07:33:24 |
+------+---------------------+---------------------+
Each task can have one, several or no pauses between the start and finish date:
+------+---------+---------------------+---------------------+
| _id | _job_id | _pause_start | _pause_end |
+------+---------+---------------------+---------------------+
| 1690 | 5013 | 2021-06-28 15:26:00 | 2021-06-29 09:24:48 |
| 1700 | 5025 | 2021-06-28 16:31:00 | 2021-06-29 10:52:56 |
| 1723 | 5025 | 2021-06-29 10:54:00 | 2021-07-06 07:32:59 |
+------+---------+---------------------+---------------------+
The main issue occurs when doing analytics. Lets say we have a project who has been worked on for 3 days long (20th to 22nd of July). If I generate a report for the 21st, I only want to get the repair time on the 21st which means that I'd have to calculate the time between the pauses for that day.
So I have come up with a solution - create a view which contains daily repair time of each task:
+---------+------------+--------------+
| _job_id | _date | _repair_time |
+---------+------------+--------------+
| 123 | 2022-07-20 | 25 |
| 123 | 2022-07-21 | 180 |
| 123 | 2022-07-22 | 45 |
+---------+------------+--------------+
So that when I run the report for the 20th or 21st or 22nd of July, I'd get the repair time for that day.
Is this possible to accomplish given my table structure?
I'll give it a try and create VIEWs for each step, to make things more clear. In the end I will give an all-in-one query using CTEs, so that the real answer can be found down at step #5.
Preparation:
CREATE TABLE tasks (
id bigint NOT NULL AUTO_INCREMENT PRIMARY KEY,
task_start datetime,
task_end datetime
);
CREATE TABLE pauses (
id bigint NOT NULL AUTO_INCREMENT PRIMARY KEY,
task_id bigint NOT NULL REFERENCES tasks(id),
pause_start datetime,
pause_end datetime
);
INSERT INTO tasks(task_start, task_end) VALUES
('2022-07-02 07:45:01', NULL),
('2022-07-02 13:32:17', '2022-07-10 14:23:28');
INSERT INTO pauses(task_id, pause_start, pause_end) VALUES
(1, '2022-07-02 19:00:00', '2022-07-03 07:00:00'),
(2, '2022-07-05 14:00:00', '2022-07-05 14:05:00'),
(2, '2022-07-08 19:00:00', '2022-07-09 07:00:00');
Step #1: Transform anything to a series of events. Some task or pauses may have started but not yet ended, so fake dates are used:
CREATE OR REPLACE VIEW task_events AS
SELECT * FROM (
SELECT id AS task_id, 'task_start' AS event_type, task_start as event_time FROM tasks
UNION ALL
SELECT id, 'task_end', coalesce(task_end, date_add(task_start, INTERVAL 10 YEAR)) FROM tasks
UNION ALL
SELECT task_id, 'pause_start', pause_start FROM pauses
UNION ALL
SELECT task_id, 'pause_end', coalesce(pause_end, date_add(pause_start, INTERVAL 10 YEAR)) FROM pauses
) e
ORDER BY event_time;
Step #2: Transform the events back to periods:
CREATE OR REPLACE VIEW event_periods AS
SELECT e.*,
lag(event_type) OVER (PARTITION BY task_id ORDER BY event_time DESC) AS next_event_type,
lag(event_time) OVER (PARTITION BY task_id ORDER BY event_time DESC) AS next_event_time
FROM task_events e;
Step #3: Extract only those periods which started some work, that is whenever a task started or a pause ended.
CREATE OR REPLACE VIEW working_periods AS
SELECT * FROM event_periods WHERE event_type IN ('task_start', 'pause_end')
ORDER BY task_id, event_time;
Step #4: Use the working_periods for reporting. E. g. to get the working time between 2022-07-09 and 2022-07-11:
SELECT p.*, timediff(next_event_time, event_time) AS working_time
FROM (SELECT task_id,
event_type, greatest(event_time, '2022-07-09') AS event_time,
next_event_type, least(next_event_time, '2022-07-11') AS next_event_time
FROM working_periods
WHERE next_event_time >= '2022-07-09' and event_time <= '2022-07-11') p;
Step #5: rewrite using CTEs
WITH task_events AS (
SELECT id AS task_id, 'task_start' AS event_type, task_start as event_time FROM tasks
UNION ALL
SELECT id, 'task_end', coalesce(task_end, date_add(task_start, INTERVAL 10 YEAR)) FROM tasks
UNION ALL
SELECT task_id, 'pause_start', pause_start FROM pauses
UNION ALL
SELECT task_id, 'pause_end', coalesce(pause_end, date_add(pause_start, INTERVAL 10 YEAR)) FROM pauses
),
event_periods AS (
SELECT e.*,
lag(event_type) OVER (PARTITION BY task_id ORDER BY event_time DESC) AS next_event_type,
lag(event_time) OVER (PARTITION BY task_id ORDER BY event_time DESC) AS next_event_time
FROM task_events e
),
working_periods AS (
SELECT *
FROM event_periods WHERE event_type IN ('task_start', 'pause_end')
ORDER BY task_id, event_time
),
report_periods AS (
SELECT task_id,
event_type, greatest(event_time, '2022-07-09') AS event_time,
next_event_type, least(next_event_time, '2022-07-11') AS next_event_time
FROM working_periods
WHERE next_event_time >= '2022-07-09' and event_time <= '2022-07-11'
)
SELECT p.*, timediff(next_event_time, event_time) AS working_time
FROM report_periods p;
Result:
task_id | event_type | event_time | next_event_type | next_event_time | working_time |
---|---|---|---|---|---|
1 | pause_end | 2022-07-09 00:00:00 | task_end | 2022-07-11 00:00:00 | 48:00:00 |
2 | pause_end | 2022-07-09 07:00:00 | task_end | 2022-07-10 14:23:28 | 31:23:28 |
I'm pretty sure, it's far from being perfect, but perhaps it can serve as a starting point.