I am running into performance issues with Pandas and writing DataFrames to an SQL DB. In order to be as fast as possible I use memSQL (it's like MySQL in code, so I don't have to do anything). I benchmarked my instance just now:
docker run --rm -it --link=memsql:memsql memsql/quickstart simple-benchmark
Creating database simple_benchmark
Warming up workload
Launching 10 workers
Workload will take approximately 30 seconds.
Stopping workload
42985000 rows inserted using 10 threads
1432833.3 rows per second
That isn't glorious, and it's just my local laptop. I know... I am also using the root user, but it's a throw-away Docker container.
Here is the code which writes my DataFrame to the DB:
import MySQLdb
import mysql.connector
from sqlalchemy import create_engine
from pandas.util.testing import test_parallel
engine = create_engine('mysql+mysqlconnector://root@localhost:3306/netflow_test', echo=False)
# max_allowed_packet = 1000M in mysql.conf
# no effect
# @test_parallel(num_threads=8)
def commit_flows(netflow_df2):
% time netflow_df2.to_sql(name='netflow_ids', con=engine, if_exists = 'append', index=False, chunksize=500)
commit_flows(netflow_df2)
Below is the %time
measurement of the function.
Multi-threading does not make this faster. It remains within 7000 - 8000 rows/s.
CPU times: user 2min 6s, sys: 1.69 s, total: 2min 8s Wall time: 2min 18s
I also increased the max_allowed_packet
size to commit in bulk, with a larger chunk size. Still not faster.
Here is the shape of the DataFrame:
netflow_df2.shape
(1015391, 20)
Does anyone know how I can make this faster?
In case someones gets a similar situation:
I removed SQlalchemy and used the (deprecated) MySQL flavor for Pandas' to_sql()
function. The speedup is more than 120 %. I don't recommend to use this, but it works for me at the moment.
import MySQLdb
import mysql.connector
from sqlalchemy import create_engine
from pandas.util.testing import test_parallel
engine = MySQLdb.connect("127.0.0.1","root","","netflow_test")
# engine = create_engine('mysql+mysqlconnector://root@localhost:3306/netflow_test', echo=False)
# @test_parallel(num_threads=8)
def commit_flows(netflow_df2):
% time netflow_df2.to_sql(name='netflow_ids', flavor='mysql', con=engine, if_exists = 'append', index=False, chunksize=50000)
commit_flows(netflow_df2)
If I find out how to convince memSQL to accept a large query (similar to MySQL's max_allowed_packet = 1000M
in mysql.conf) I will be even faster. I should be able to hit more than 50000 rows per second here.
CPU times: user 28.7 s, sys: 797 ms, total: 29.5 s
Wall time: 38.2 s
126s before. 38.2s now.