python mysql database concurrency amazon-rds

MySQL slow concurrent/parallel queries in Python

We are using very simple SELECT … WHERE queries (1000 rows, from a very small table with 18000 entries) to load time series data for plotting (for testing we are running the same query X times). While single queries to the MySQL database are fast as expected, multiple parallel queries are very slow (simplified python code for parallel queries below). When increasing the amount of parallel queries by a factor of 10 the mysql.slow_log shows query times that are twice as long, but in python the time for a single query fivefolds to tenfolds (0.04 seconds vs 0.4 second).

Furthermore, we can see that the first queries are much faster than the last ones (0.5 seconds for the last query vs. 0.06 seconds for the first query)

Obviously the table is fully indexed.

We also tried this with just a SLEEP(2) query and are getting a smaller but similar effect: For 20 of those each one comes back after around 2.03s, for 200 they take around 2.1s.

The innodb buffer seems to work fine as MySQL shows 0 Disk/IO reads when running the queries (all the data seems to be in memory). That means storage/IO is probably not the Problem.

CPU usage is probably not the bottleneck as indicated by the CPU usage graphs in AWS RDS. Network usage seems not to be the problem as well, because we see the effect also on smaller queries.

We are using the AWS RDS MySQL service on an db.t3.large instance (2 vCPU, 8GB RAM, 5GBit/s Network).

Code to reproduce the problem:

from concurrent.futures import ThreadPoolExecutor 
import pandas as pd 
import time 

executor = ThreadPoolExecutor(max_workers=200) 
futures = [] 

def read_db(query): 
    con = engine.connect() 
    start_single_query = time.time() 
    print(start_single) 
    result = pd.read_sql(query, con) 
    print(f"query finished: {time.time() - start_single}s")     
    con.close() 
    return result 

start_complete = time.time() 
query = "SELECT * FROM table WHERE id_security = 1000;" 
# query = "SELECT SLEEP(2)" 

for i in range(100):  
    future = executor.submit(read_db, query) 
    futures.append(future) 

for future in futures: # wait for all results 
    f=future.result() 
 
print(f"time total: {time.time() - start_complete}s")

So the main questions are:

Why are the concurrent queries slower, even though we seem to not have any IO/CPU/Network bottlenecks?
Why is the query_time in the Slow Query log not similar to the query time, we can measure in Python?
Is this a sensible way to parallelize the queries in Python?

Update:

Create Table Statement:

CREATE TABLE `table` (
  `id_security` int(11) NOT NULL,
  `id_field` int(11) NOT NULL,
  `currency` varchar(3) DEFAULT NULL,
  `date` date NOT NULL,
  `timestamp` datetime DEFAULT CURRENT_TIMESTAMP,
  `value` double DEFAULT NULL,
  `source` int(11) NOT NULL DEFAULT '0',
  PRIMARY KEY (`id_security`,`id_field`,`date`,`source`),
  KEY `Dateindex` (`date`),
  KEY `id_security_id_field` (`id_security`,`id_field`),
  KEY `id_security` (`id_security`),
  KEY `id_field` (`id_field`),
  KEY `id_source` (`source`),
  KEY `PRIMARY_WITHOUT_SOURCE` (`id_security`,`id_field`,`date`,`value`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

These are the tried queries and their Explain statements (Don't mind the different table name in production): SELECT * FROM series_security WHERE id_security = 1002 AND id_field = 100 AND date>'2021-01-01';

SELECT * FROM series_security WHERE id_security = 1002 AND id_field = 100;

SELECT * FROM series_security WHERE id_security = 1002

The real table is 18.000.000 lines long, however that does not slow down the requests meaningfully. So the indices seem to work. Nevertheless, we tried it on the smaller 18.000 row table to be sure.

Update 2: That is just a dummy table to test the issue. The many indexes are not in production, otherwise Insert statements would be probably too slow.

The production table is created like that:

CREATE TABLE `table` (
  `id_security` int(11) NOT NULL,
  `id_field` int(11) NOT NULL,
  `currency` varchar(3) DEFAULT NULL,
  `date` date NOT NULL,
  `timestamp` datetime DEFAULT CURRENT_TIMESTAMP,
  `value` double DEFAULT NULL,
  `source` int(11) NOT NULL DEFAULT '0',
  PRIMARY KEY (`id_security`,`id_field`,`date`,`source`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

We played around with the indexes to get a feeling for the problem. But as long as the PRIMARY KEY is in there, it doesn't make a difference.

The id in the WHERE clause is a single integer.

innodb_buffer_pool_size is 3/4 of the memory, so 6GB (default RDS setting)

As we use the SAME query for all requests in this dummy example, innodb_buffer_pool is hit 100% and we have zero IO reads.

Futhermore even if we do a smaller query on a tiny dummy table, we see a similar effect.

In Production usually 500 to 20,000 lines are fetched with one query (Timeseries data).

In this dummy example we tried up to 200 threads (s. max_worker)

Regarding the production environment:

We are using the Flask framework
The server runs on AWS Beanstalk and we allow up to 200 Threads on the webserver (So 200 concurrent web requests should be handeled). We increased that to this high number, because of those slow requests that would otherwise block further requests.
The server is usually hit with up to 20 web requests at the same time
Each of those 20 web requests, can spawn up to 16 further threads to parallely pull data from the MySQL Webserver.
So in an extreme case we see up to 100 to 200 concurrent requets on the database. This is why we created the dummy case to simulate the behavior to understand where the bottleneck is.
"Select *" was also simplified for the dummy example, usually we just pull "date" and "value" and we use different id_security values

Is there a better way to test, if the Futures/Threads Overhead in Python is the problem or the concurrent requests on the MySQL database itself? Is it right to assume, that MySQL should easily be able to handle 100 concurrent queries that are completly in the innodb_buffer_pool, without slowing down by a factor of 5 to 10?

Update 3:

As requested:

SELECT COUNT(*) FROM information_schema.tables;: https://pastebin.com/SRmKupdC -- 392 rows

SHOW GLOBAL STATUS; https://pastebin.com/6F8kjCX6

SHOW GLOBAL VARIABLES; https://pastebin.com/r5tmFX7d

SHOW FULL PROCESSLIST; (In Idle) https://pastebin.com/Q3DhEZxV -- Nothing exciting

SHOW FULL PROCESSLIST; (With some load on webserver) https://pastebin.com/4FK1FjEC -- One query, running for only 0 seconds:

SELECT  date as d,
    value  as v
    FROM  datalake.series_security
    WHERE  id_security = 1008
      AND  date >= '1962-02-09 00:00:00'
      AND  date <= '2099-01-01 00:00:00'
      AND  id_field=100
    ORDER BY  date ASC |

STATUS https://pastebin.com/X5SL64XK -- mostly latin1

SHOW ENGINE INNODB STATUS; (Idle) https://pastebin.com/JPiqF76s --yawn.

SHOW ENGINE INNODB STATUS; (With some load on webserver) https://pastebin.com/GGsmzsXz -- essentially idle

Solution

We found, that it has nothing to do with MySQL but with the Global Interpreter Lock (GIL) of Python.

Due to the fact that the function read_db() /pd.read_sql(query, con) is CPU-bound and Python has the GIL, the query results are received and processed sequentially.

One solution is to use multiprocessing instead of multithreading. One can easily exchange the ThreadPoolExecutor with the ProcessPoolExecutor from concurrent.futures.