Search code examples
mysqlviewdata-analysis

How to create a MYSQL View of task time per day given start date, finish date and pauses in between


I have a system which calculates the work time of a user on a specific task. The main issue is that the tasks take a variable amount of time to complete. Some take a few hours and some take a few days or even a week.

Each task has a start and a stop time as well as pauses in between:

+------+---------------------+---------------------+
| _id  | _start_date         | _finish_date        |
+------+---------------------+---------------------+
| 5013 | 2021-06-28 15:23:00 | 2021-06-29 09:25:01 |
| 5020 | 2021-06-28 15:33:00 | 2021-06-28 15:35:57 |
| 5025 | 2021-06-28 15:41:00 | 2021-07-06 07:33:24 |
+------+---------------------+---------------------+

Each task can have one, several or no pauses between the start and finish date:

+------+---------+---------------------+---------------------+
| _id  | _job_id | _pause_start        | _pause_end          |
+------+---------+---------------------+---------------------+
| 1690 |    5013 | 2021-06-28 15:26:00 | 2021-06-29 09:24:48 |
| 1700 |    5025 | 2021-06-28 16:31:00 | 2021-06-29 10:52:56 |
| 1723 |    5025 | 2021-06-29 10:54:00 | 2021-07-06 07:32:59 |
+------+---------+---------------------+---------------------+

The main issue occurs when doing analytics. Lets say we have a project who has been worked on for 3 days long (20th to 22nd of July). If I generate a report for the 21st, I only want to get the repair time on the 21st which means that I'd have to calculate the time between the pauses for that day.

So I have come up with a solution - create a view which contains daily repair time of each task:

+---------+------------+--------------+
| _job_id | _date      | _repair_time |
+---------+------------+--------------+
|     123 | 2022-07-20 |           25 |
|     123 | 2022-07-21 |          180 |
|     123 | 2022-07-22 |           45 |
+---------+------------+--------------+

So that when I run the report for the 20th or 21st or 22nd of July, I'd get the repair time for that day.

Is this possible to accomplish given my table structure?


Solution

  • I'll give it a try and create VIEWs for each step, to make things more clear. In the end I will give an all-in-one query using CTEs, so that the real answer can be found down at step #5.

    Preparation:

    CREATE TABLE tasks (
      id bigint NOT NULL AUTO_INCREMENT PRIMARY KEY,
      task_start datetime,
      task_end datetime
    );
    
    CREATE TABLE pauses (
      id bigint NOT NULL AUTO_INCREMENT PRIMARY KEY,
      task_id bigint NOT NULL REFERENCES tasks(id),
      pause_start datetime,
      pause_end datetime
    );
    
    INSERT INTO tasks(task_start, task_end) VALUES
      ('2022-07-02 07:45:01', NULL),
      ('2022-07-02 13:32:17', '2022-07-10 14:23:28');
    
    INSERT INTO pauses(task_id, pause_start, pause_end) VALUES
      (1, '2022-07-02 19:00:00', '2022-07-03 07:00:00'),
      (2, '2022-07-05 14:00:00', '2022-07-05 14:05:00'),
      (2, '2022-07-08 19:00:00', '2022-07-09 07:00:00');
    

    Step #1: Transform anything to a series of events. Some task or pauses may have started but not yet ended, so fake dates are used:

    CREATE OR REPLACE VIEW task_events AS 
    SELECT * FROM (  
        SELECT id AS task_id, 'task_start' AS event_type, task_start as event_time FROM tasks
        UNION ALL
        SELECT id, 'task_end', coalesce(task_end, date_add(task_start, INTERVAL 10 YEAR)) FROM tasks
        UNION ALL
        SELECT task_id, 'pause_start', pause_start FROM pauses
        UNION ALL
        SELECT task_id, 'pause_end', coalesce(pause_end, date_add(pause_start, INTERVAL 10 YEAR)) FROM pauses
    ) e
    ORDER BY event_time;
    

    Step #2: Transform the events back to periods:

    CREATE OR REPLACE VIEW event_periods AS
    SELECT e.*,
      lag(event_type) OVER (PARTITION BY task_id ORDER BY event_time DESC) AS next_event_type,
      lag(event_time) OVER (PARTITION BY task_id ORDER BY event_time DESC) AS next_event_time
    FROM task_events e;
    

    Step #3: Extract only those periods which started some work, that is whenever a task started or a pause ended.

    CREATE OR REPLACE VIEW working_periods AS
    SELECT * FROM event_periods WHERE event_type IN ('task_start', 'pause_end')
    ORDER BY task_id, event_time;
    

    Step #4: Use the working_periods for reporting. E. g. to get the working time between 2022-07-09 and 2022-07-11:

    SELECT p.*, timediff(next_event_time, event_time) AS working_time 
      FROM (SELECT task_id,
                   event_type, greatest(event_time, '2022-07-09') AS event_time,
                   next_event_type, least(next_event_time, '2022-07-11') AS next_event_time
              FROM working_periods
             WHERE next_event_time >= '2022-07-09' and event_time <= '2022-07-11') p;
    

    Step #5: rewrite using CTEs

    WITH task_events AS (  
            SELECT id AS task_id, 'task_start' AS event_type, task_start as event_time FROM tasks
            UNION ALL
            SELECT id, 'task_end', coalesce(task_end, date_add(task_start, INTERVAL 10 YEAR)) FROM tasks
            UNION ALL
            SELECT task_id, 'pause_start', pause_start FROM pauses
            UNION ALL
            SELECT task_id, 'pause_end', coalesce(pause_end, date_add(pause_start, INTERVAL 10 YEAR)) FROM pauses
        ),
        event_periods AS (
            SELECT e.*,
              lag(event_type) OVER (PARTITION BY task_id ORDER BY event_time DESC) AS next_event_type,
              lag(event_time) OVER (PARTITION BY task_id ORDER BY event_time DESC) AS next_event_time
            FROM task_events e
        ),
        working_periods AS (
            SELECT * 
              FROM event_periods WHERE event_type IN ('task_start', 'pause_end')
             ORDER BY task_id, event_time
        ),
        report_periods AS (
            SELECT task_id,
                   event_type, greatest(event_time, '2022-07-09') AS event_time,
                   next_event_type, least(next_event_time, '2022-07-11') AS next_event_time
              FROM working_periods
             WHERE next_event_time >= '2022-07-09' and event_time <= '2022-07-11'
        )
    SELECT p.*, timediff(next_event_time, event_time) AS working_time
      FROM report_periods p;
    

    Result:

    task_id event_type event_time next_event_type next_event_time working_time
    1 pause_end 2022-07-09 00:00:00 task_end 2022-07-11 00:00:00 48:00:00
    2 pause_end 2022-07-09 07:00:00 task_end 2022-07-10 14:23:28 31:23:28

    I'm pretty sure, it's far from being perfect, but perhaps it can serve as a starting point.