We are using very simple SELECT … WHERE queries (1000 rows, from a very small table with 18000 entries) to load time series data for plotting (for testing we are running the same query X times). While single queries to the MySQL database are fast as expected, multiple parallel queries are very slow (simplified python code for parallel queries below). When increasing the amount of parallel queries by a factor of 10 the mysql.slow_log shows query times that are twice as long, but in python the time for a single query fivefolds to tenfolds (0.04 seconds vs 0.4 second).
Furthermore, we can see that the first queries are much faster than the last ones (0.5 seconds for the last query vs. 0.06 seconds for the first query)
Obviously the table is fully indexed.
We also tried this with just a SLEEP(2) query and are getting a smaller but similar effect: For 20 of those each one comes back after around 2.03s, for 200 they take around 2.1s.
The innodb buffer seems to work fine as MySQL shows 0 Disk/IO reads when running the queries (all the data seems to be in memory). That means storage/IO is probably not the Problem.
CPU usage is probably not the bottleneck as indicated by the CPU usage graphs in AWS RDS. Network usage seems not to be the problem as well, because we see the effect also on smaller queries.
We are using the AWS RDS MySQL service on an db.t3.large instance (2 vCPU, 8GB RAM, 5GBit/s Network).
Code to reproduce the problem:
from concurrent.futures import ThreadPoolExecutor
import pandas as pd
import time
executor = ThreadPoolExecutor(max_workers=200)
futures = []
def read_db(query):
con = engine.connect()
start_single_query = time.time()
print(start_single)
result = pd.read_sql(query, con)
print(f"query finished: {time.time() - start_single}s")
con.close()
return result
start_complete = time.time()
query = "SELECT * FROM table WHERE id_security = 1000;"
# query = "SELECT SLEEP(2)"
for i in range(100):
future = executor.submit(read_db, query)
futures.append(future)
for future in futures: # wait for all results
f=future.result()
print(f"time total: {time.time() - start_complete}s")
So the main questions are:
Update:
Create Table Statement:
CREATE TABLE `table` (
`id_security` int(11) NOT NULL,
`id_field` int(11) NOT NULL,
`currency` varchar(3) DEFAULT NULL,
`date` date NOT NULL,
`timestamp` datetime DEFAULT CURRENT_TIMESTAMP,
`value` double DEFAULT NULL,
`source` int(11) NOT NULL DEFAULT '0',
PRIMARY KEY (`id_security`,`id_field`,`date`,`source`),
KEY `Dateindex` (`date`),
KEY `id_security_id_field` (`id_security`,`id_field`),
KEY `id_security` (`id_security`),
KEY `id_field` (`id_field`),
KEY `id_source` (`source`),
KEY `PRIMARY_WITHOUT_SOURCE` (`id_security`,`id_field`,`date`,`value`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
These are the tried queries and their Explain statements (Don't mind the different table name in production):
SELECT * FROM series_security WHERE id_security = 1002 AND id_field = 100 AND date>'2021-01-01';
SELECT * FROM series_security WHERE id_security = 1002 AND id_field = 100;
SELECT * FROM series_security WHERE id_security = 1002
The real table is 18.000.000 lines long, however that does not slow down the requests meaningfully. So the indices seem to work. Nevertheless, we tried it on the smaller 18.000 row table to be sure.
Update 2: That is just a dummy table to test the issue. The many indexes are not in production, otherwise Insert statements would be probably too slow.
The production table is created like that:
CREATE TABLE `table` (
`id_security` int(11) NOT NULL,
`id_field` int(11) NOT NULL,
`currency` varchar(3) DEFAULT NULL,
`date` date NOT NULL,
`timestamp` datetime DEFAULT CURRENT_TIMESTAMP,
`value` double DEFAULT NULL,
`source` int(11) NOT NULL DEFAULT '0',
PRIMARY KEY (`id_security`,`id_field`,`date`,`source`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;
We played around with the indexes to get a feeling for the problem. But as long as the PRIMARY KEY is in there, it doesn't make a difference.
The id in the WHERE clause is a single integer.
innodb_buffer_pool_size
is 3/4 of the memory, so 6GB (default RDS setting)
As we use the SAME query for all requests in this dummy example, innodb_buffer_pool
is hit 100% and we have zero IO reads.
Futhermore even if we do a smaller query on a tiny dummy table, we see a similar effect.
In Production usually 500 to 20,000 lines are fetched with one query (Timeseries data).
In this dummy example we tried up to 200 threads (s. max_worker)
Regarding the production environment:
id_security
valuesIs there a better way to test, if the Futures/Threads Overhead in Python is the problem or the concurrent requests on the MySQL database itself? Is it right to assume, that MySQL should easily be able to handle 100 concurrent queries that are completly in the innodb_buffer_pool
, without slowing down by a factor of 5 to 10?
Update 3:
As requested:
SELECT COUNT(*) FROM information_schema.tables;: https://pastebin.com/SRmKupdC -- 392 rows
SHOW GLOBAL STATUS; https://pastebin.com/6F8kjCX6
SHOW GLOBAL VARIABLES; https://pastebin.com/r5tmFX7d
SHOW FULL PROCESSLIST; (In Idle) https://pastebin.com/Q3DhEZxV -- Nothing exciting
SHOW FULL PROCESSLIST; (With some load on webserver) https://pastebin.com/4FK1FjEC -- One query, running for only 0 seconds:
SELECT date as d,
value as v
FROM datalake.series_security
WHERE id_security = 1008
AND date >= '1962-02-09 00:00:00'
AND date <= '2099-01-01 00:00:00'
AND id_field=100
ORDER BY date ASC |
STATUS
https://pastebin.com/X5SL64XK -- mostly latin1
SHOW ENGINE INNODB STATUS; (Idle) https://pastebin.com/JPiqF76s --yawn.
SHOW ENGINE INNODB STATUS; (With some load on webserver) https://pastebin.com/GGsmzsXz -- essentially idle
We found, that it has nothing to do with MySQL but with the Global Interpreter Lock (GIL) of Python.
Due to the fact that the function read_db()
/pd.read_sql(query, con)
is CPU-bound and Python has the GIL, the query results are received and processed sequentially.
One solution is to use multiprocessing instead of multithreading. One can easily exchange the ThreadPoolExecutor
with the ProcessPoolExecutor
from concurrent.futures.