Search code examples
mysqldateactivitylog

Finding the times that most people were in a chat room


I am trying to work out the best way that I can find out in which times during the day had the most users in the chat room.

I tried grouping the dates, but it doesn't really work because of the intervals. Is there some sort of window function I can use in either MySQL? Could maybe moving this data into ClickHouse be a more effective way of getting this information?

There are ways that you can find out the total amount of overlapping time ranges - ie, the amount of collisions that happen between enter_time and leave_time (see: How to find maximum time range collision occurencies in Mysql) but this does not allow you to find the exact times when there were the most users in the chat room.

A picture describing the overlay of times for different users

What would be the most effective way to build a query that would show the times with the most users was

The outcome that I am trying to achieve

busy_time_start      | busy_time_end
--------------------------------------------
2022-09-10T03:10:00Z | 2022-09-10T05:59:00Z
2022-09-10T06:05:00Z | 2022-09-10T09:59:00Z

I have a table that has a list of users that enter a chat room. It describes the user_id, enter_time, and leave_time.

What I would like to try and work out is what are the duration(s) that had the most amount of people in the chat room for a given day.

What does my data look like?

MySQL Schema Setup:

CREATE TABLE `user_log` (
  `user_id` int DEFAULT NULL,
  `enter_time` datetime DEFAULT NULL,
  `leave_time` datetime DEFAULT NULL);
  
insert into user_log values
(1, '2022-09-10 01:00:00', '2022-09-10 12:00:00'),
(2, '2022-09-10 02:10:00', '2022-09-10 11:59:00'),
(3, '2022-09-10 03:30:00', '2022-09-10 10:59:00'),
(4, '2022-09-10 04:10:00', '2022-09-10 09:59:00'),
(5, '2022-09-10 02:05:00', '2022-09-10 05:59:00'),
(6, '2022-09-10 06:05:00', '2022-09-10 08:59:00');

Query 1:

select * from user_log

Results:

| user_id |           enter_time |           leave_time |
|---------|----------------------|----------------------|
|       1 | 2022-09-10T01:00:00Z | 2022-09-10T12:00:00Z |
|       2 | 2022-09-10T02:10:00Z | 2022-09-10T11:59:00Z |
|       3 | 2022-09-10T03:30:00Z | 2022-09-10T10:59:00Z |
|       4 | 2022-09-10T04:10:00Z | 2022-09-10T09:59:00Z |
|       5 | 2022-09-10T02:05:00Z | 2022-09-10T05:59:00Z |
|       6 | 2022-09-10T06:05:00Z | 2022-09-10T08:59:00Z |

Solution

  • First we find all the times where user count could change. It looks like you are always rounding to the nearest minute and your end times are inclusive (up through that time, not up to but not through that time). So:

    select enter_time as time from user_log
    union distinct
    select leave_time + interval 1 minute from user_log
    

    For each of those times, find the count of users and the end time (using lead):

    select
        boundaries.time as busy_time_start,
        count(user_log.user_id) as user_count,
        lead(boundaries.time) over (order by boundaries.time) - interval 1 minute as busy_time_end
    from (
        select enter_time as time from user_log
        union distinct
        select leave_time + interval 1 minute from user_log
    ) boundaries
    left join user_log on boundaries.time between user_log.enter_time and user_log.leave_time
    group by boundaries.time
    

    Now we just need to narrow it down to the rows with the maximum user count:

    select busy_time_start, busy_time_end
    from (
        select
            boundaries.time as busy_time_start,
            count(user_log.user_id) as user_count,
            lead(boundaries.time) over (order by boundaries.time) - interval 1 minute as busy_time_end,
            max(count(user_log.user_id)) over () as max_user_count
        from (
            select enter_time as time from user_log
            union distinct
            select leave_time + interval 1 minute from user_log
        ) boundaries
        left join user_log on boundaries.time between user_log.enter_time and user_log.leave_time
        group by boundaries.time
    ) busy_times
    where user_count = max_user_count
    

    If you have cases where a user leaves and another user enters the following minute, this could produce adjacent ranges that could be merged together; to do that you'd treat this as a gaps and islands problem and group by each range of rows with the same count, selecting the minimum start time and maximum end time for each range.

    The join to user_log will be O(n^2); if you have some known maximum time a user will be in the chat for and that is much less than the total time range of your data, you can index user_log on enter time and add an enter_time range to the on clause.