Search code examples
sqlmysqlquery-optimization

fastest way for getting concurrent entries in large database


i need to get concurrent sessions for every minute of day from database with ~2 million rows. what is the best(fastest) way to get that data?

table contains session_id, session_start_time, session_end_time

example

CREATE TABLE `session_id_table` (
  `session_id` bigint NOT NULL,
  `session_start_time` datetime DEFAULT NULL,
  `session_end_time` datetime DEFAULT NULL,
  PRIMARY KEY (`session_id`),
  UNIQUE KEY `session_id_UNIQUE` (`session_id`),
  KEY `idx_session_id_table_session_id` (`session_id`),
  KEY `time_1` (`session_start_time`,`session_end_time`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb3 COLLATE=utf8mb3_bin;


insert into session_id_table(session_id, session_start_time, session_end_time) values 
(1, '2024-09-05 10:00:23', '2024-09-05 10:15:54'), 
(2, '2024-09-05 10:00:29', '2024-09-05 10:22:54'), 
(3, '2024-09-05 10:01:23', '2024-09-05 10:26:54'), 
(4, '2024-09-05 10:02:20', '2024-09-05 10:21:54'), 
(5, '2024-09-05 10:02:23', '2024-09-05 10:20:54'),
(6, '2024-09-05 10:21:00', '2024-09-05 10:35:54') 

result should be

'2024-09-05 10:01:00', 1 
'2024-09-05 10:01:00', 2 
'2024-09-05 10:02:00', 1
'2024-09-05 10:02:00', 2
'2024-09-05 10:02:00', 3
'2024-09-05 10:21:00', 2
'2024-09-05 10:21:00', 3
'2024-09-05 10:21:00', 4
'2024-09-05 10:21:00', 6

so for every minute of day it needs to show active sessions at that time.

first i tried to do join for small part of data for 2 days (~30k rows per day, 300-700 session_ids per minute)

select distinct date_format(a.session_start_time, '%Y-%c-%e %H:%i'), b.session_id 
from session_id_table as a
left join 
session_id_table as b
on date_format(a.session_start_time, '%Y-%c-%e %H:%i')>=b.session_start_time and date_format(a.session_start_time, '%Y-%c-%e %H:%i')<=b.session_end_time
and date(a.session_start_time)>'2024-09-05'

but query didnt execute even after 3 hours


Solution

  • wanted to do answer this 3 days ago, but when benchmarking i noticed weird results(that i posted in another thread).

    anyways fastest way to get concurrent session id-s is through comparing timestamps that are in another table. comparing datetimes on join gives 20-35% slower results than with timestamp. indexing data that is used on join argument seems to slow down everything not on the edges of database(added first or added last).

    select dt, session_id from datetime_timestamp as a
    inner join session_id_table as b 
    on (dt>=session_start_time) and (dt<=session_end_time)
    #where dt>='2023-05-25 00:00:00' and dt<='2023-05-25 01:00:00'
    where dt='2023-10-25 00:00:00' and dt<='2023-10-25 01:00:00'
    #where dt>='2023-12-10 00:00:00' and dt<='2023-12-10 01:00:00'
    #where dt>='2024-09-10 00:00:00' and dt<='2024-09-10 01:00:00'
    #where dt>='2024-09-20 12:00:00' and dt<='2024-09-20 13:00:00'
    

    timestamp without index compared to same query with datetime instead of timestamps datetime without index

    if we use CTE with timestamps:

    with 
    dates as
    (select distinct convert(date_format(session_start_time, '%Y-%c-%e %H:%i'), datetime) as date_time, unix_timestamp(convert(date_format(session_start_time, '%Y-%c-%e %H:%i'), datetime)) as ts
    from session_id_table)
    select date_time, session_id from dates as a
    inner join session_id_table as b 
    on (ts>=start_ts) and (ts<=end_ts)
    #where date_time>='2023-05-25 00:00:00' and date_time<='2023-05-25 01:00:00'
    where date_time>='2023-10-25 00:00:00' and date_time<='2023-10-25 01:00:00'
    #where date_time>='2023-12-10 00:00:00' and date_time<='2023-12-10 01:00:00'
    #where date_time>='2024-09-10 00:00:00' and date_time<='2024-09-10 01:00:00'
    #where date_time>='2024-09-19 00:00:00' and date_time<='2024-09-19 01:00:00'
    

    we get these execution times

    timestamps with CTE

    if we run same CTE query but just have join on datetime instead of timestamps we get these results: datetime with CTE

    when i started this thread with python+sql script i needed to run ~55 seconds for every hour of data in database, but i could run this concurrently for faster execution.

    with answer from d r i got ~5 seconds execution time when i tested it, giving the query significant improvement, but i didnt test this extensively.

    2 weeks after, i decided to benchmark new query after learning where i failed in OP, saw weird results, run d r query saw weird results, so i created new thread about that weirdness. overall indexes gave super fast results at beginning of the table, but much slower results at anything not at edge, so d r answer was slower than python+sql solution that i already had for over 90% of the table.