I am trying to work out the best way that I can find out in which times during the day had the most users in the chat room.
I tried grouping the dates, but it doesn't really work because of the intervals. Is there some sort of window function I can use in either MySQL? Could maybe moving this data into ClickHouse be a more effective way of getting this information?
There are ways that you can find out the total amount of overlapping time ranges - ie, the amount of collisions that happen between enter_time
and leave_time
(see: How to find maximum time range collision occurencies in Mysql) but this does not allow you to find the exact times when there were the most users in the chat room.
A picture describing the overlay of times for different users
What would be the most effective way to build a query that would show the times with the most users was
busy_time_start | busy_time_end
--------------------------------------------
2022-09-10T03:10:00Z | 2022-09-10T05:59:00Z
2022-09-10T06:05:00Z | 2022-09-10T09:59:00Z
I have a table that has a list of users that enter a chat room. It describes the user_id, enter_time, and leave_time.
What I would like to try and work out is what are the duration(s) that had the most amount of people in the chat room for a given day.
MySQL Schema Setup:
CREATE TABLE `user_log` (
`user_id` int DEFAULT NULL,
`enter_time` datetime DEFAULT NULL,
`leave_time` datetime DEFAULT NULL);
insert into user_log values
(1, '2022-09-10 01:00:00', '2022-09-10 12:00:00'),
(2, '2022-09-10 02:10:00', '2022-09-10 11:59:00'),
(3, '2022-09-10 03:30:00', '2022-09-10 10:59:00'),
(4, '2022-09-10 04:10:00', '2022-09-10 09:59:00'),
(5, '2022-09-10 02:05:00', '2022-09-10 05:59:00'),
(6, '2022-09-10 06:05:00', '2022-09-10 08:59:00');
Query 1:
select * from user_log
| user_id | enter_time | leave_time |
|---------|----------------------|----------------------|
| 1 | 2022-09-10T01:00:00Z | 2022-09-10T12:00:00Z |
| 2 | 2022-09-10T02:10:00Z | 2022-09-10T11:59:00Z |
| 3 | 2022-09-10T03:30:00Z | 2022-09-10T10:59:00Z |
| 4 | 2022-09-10T04:10:00Z | 2022-09-10T09:59:00Z |
| 5 | 2022-09-10T02:05:00Z | 2022-09-10T05:59:00Z |
| 6 | 2022-09-10T06:05:00Z | 2022-09-10T08:59:00Z |
First we find all the times where user count could change. It looks like you are always rounding to the nearest minute and your end times are inclusive (up through that time, not up to but not through that time). So:
select enter_time as time from user_log
union distinct
select leave_time + interval 1 minute from user_log
For each of those times, find the count of users and the end time (using lead):
select
boundaries.time as busy_time_start,
count(user_log.user_id) as user_count,
lead(boundaries.time) over (order by boundaries.time) - interval 1 minute as busy_time_end
from (
select enter_time as time from user_log
union distinct
select leave_time + interval 1 minute from user_log
) boundaries
left join user_log on boundaries.time between user_log.enter_time and user_log.leave_time
group by boundaries.time
Now we just need to narrow it down to the rows with the maximum user count:
select busy_time_start, busy_time_end
from (
select
boundaries.time as busy_time_start,
count(user_log.user_id) as user_count,
lead(boundaries.time) over (order by boundaries.time) - interval 1 minute as busy_time_end,
max(count(user_log.user_id)) over () as max_user_count
from (
select enter_time as time from user_log
union distinct
select leave_time + interval 1 minute from user_log
) boundaries
left join user_log on boundaries.time between user_log.enter_time and user_log.leave_time
group by boundaries.time
) busy_times
where user_count = max_user_count
If you have cases where a user leaves and another user enters the following minute, this could produce adjacent ranges that could be merged together; to do that you'd treat this as a gaps and islands problem and group by each range of rows with the same count, selecting the minimum start time and maximum end time for each range.
The join to user_log will be O(n^2); if you have some known maximum time a user will be in the chat for and that is much less than the total time range of your data, you can index user_log on enter time and add an enter_time range to the on clause.