I have a MySQL server running on a remote host. The connection to the host is fairly slow and it affects the performance of the Python code I am using. I find that using the executemany() function makes a big improvement over using an loop to insert many rows. My challenge is that for each row I insert into one table, I need to insert several rows in another table. My sample below does not contain much data, but my production data could be thousands of rows.
I know that this subject has been asked about many times in many places, but I don't see any kind of definitive answer, so I'm asking here...
My tables...
Table: makes
pkey bigint autoincrement primary_key
make varchar(255) not_null
Table: models
pkey bigint autoincrement primary_key
make_key bigint not null
model varchar(255) not_null
...and the code...
...
cars = {"Ford": ["F150", "Fusion", "Taurus"],
"Chevrolet": ["Malibu", "Camaro", "Vega"],
"Chrysler": ["300", "200"],
"Toyota": ["Prius", "Corolla"]}
# Fill makes table with car makes
sql_data = list(cars.keys())
sql = "INSERT INTO makes (make) VALUES (%s)"
cursor.executemany(sql, sql_data)
rows_added = len(sqldata)
# Find the primary key for the first row that was just added
sql = "SELECT LAST_INSERT_ID()"
cursor.execute(sql)
rows = cursor.fetchall()
first_key = rows[0][0]
# Fill the models table with the car models, linked to their make
this_key = first_key
sql_data = []
for car in cars:
for model in cars[car]:
sql_data.append((this_key, car))
this_key += 1
sql = "INSERT INTO models (make_key, model) VALUES (%s, %s)"
cursor.executemany(sql, sql_data)
cursor.execute("COMMIT")
...
I have, more than once, measured about 10x speedup when batching inserts.
If you are inserting 1 row in table A, then 100 rows in table B, don't worry about the speed of the 1 row; worry about the speed of the 100.
Yes, it is clumsy to get the ids generated by an insert. I have found no straightforward way like LAST_INSERT_ID
, but that works only for a single-row insert.
So, I have developed the following to do a batch of "normalization" inserts. This is where you a have a table that maps strings to ids (where the string is likely to show up repeatedly). It takes 2 steps: First a batch insert of the "new" strings. Then fetch all the needed ids and copy them into the other table. The details are laid out here: http://mysql.rjweb.org/doc.php/staging_table#normalization (Sorry, I am not fluent in python or the hundred other ways to talk to MySQL, so I can't give you python code.)
Your use case example is "normalization"; I recommend doing it outside the main transaction. Note that my code takes care of multiple connections, avoiding 'burning' ids, etc.
When you have subcategories ("make" + "model" or "city" + "state" + "country"), I recommend a single normalization table, not one for each.
In your example, pkey
could be a 2-byte SMALLINT UNSIGNED
(limit 64K) instead of a bulky 8-byte BIGINT
.