I had an INSERT
query where it got values from a SELECT
statement. But since the SELECT
returns millions of records, it put too much load on the MySQL server. So, we decided to break the SELECT
query into parts and execute by having a LIMIT
clause.
INSERT INTO target_table
SELECT * FROM source_table
WHERE my_condition = value
...
LIMIT <start>, <end>
We will keep increasing start and end values until SELECT
returns 0
rows. I'm also thinking of making this multi-threaded.
How can I do it with PyMySQL?
Do I need to execute the SELECT
, get the results and then generate the INSERT
?
First of all, to answer your question: in PyMySQL, you get that value as the result of cursor.execute
:
execute(query, args=None) Execute a query Parameters: query (str) – Query to execute. args (tuple, list or dict) – parameters used with query. (optional) Returns: Number of affected rows
So you could just execute your query repeatedly until you get a value less then your selected range as a result.
Anyway, please consider:
select
(assuming it's not as simple as in your example), e.g. by adding indexes. You may also want to test the difference between just selecting and actually inserting to get a rough idea which part is more relevant.offset
, this is unlikely, see below.) Otherwise, you would usually need to add small pauses between (small enough) parts that you run consecutively, to spread the same workload over a longer time. limit
only makes sense if you have an order by
(probably by the primary key), otherwise, in successive runs, the m
-th row can be a different row than before (because the order is not fixed). This may or may not increase the load (and resource requirements) depending on your indexes and your where
-condition. my_condition
of the first row), all successive offsets will shift, and you may skip a row or get a row twice. You will probably need to lock the rows, which might prevent running your queries in parallel (as they lock the same rows), and also might influence the decision if you can split the transaction (see 2nd bullet point).offset
requires MySQL to first find and then skip rows. So if you split the query in n
parts, the first row will need to be processed n
times (and the last row usually once), so the total work (for selecting) will be increased by (n^2-n)/2
. So especially if selecting the rows is the most relevant part (see 1st bullet point), this can actually make your situation much worse: just the last run will need to find the same amount of rows as your current query (although it throws most of them away), and might even need more resources for it depending on the effect of order by
. You may be able to get around some of the offset
-problems by using the primary key in the condition, e.g. have a loop that contains something like this:
select max(id) as new_max from
where id > last_id and <your condition>
order by id limit 1000 -- no offset!
Exit the loop if new_max
is null
, otherwise do the insert:
insert ... select ...
where id > last_id and id <= new_max and <your condition>
Then set last_id = new_max
and continue the loop.
It doubles the number of queries, as in contrast to limit
with an offset
, you need to know the actual id
. It still requires your primary key and your where
-condition to be compatible (so you may need to add an index that fits). If your search condition finds a significant percentage (more than about 15% or 20%) of your source table anyway, using the primary key might be the best execution plan anyway though.
If you want to parallize this (depending on your transaction requirements and if it is potentially worthwile, see above), you could first get the maximum value for primary key (select max(id) as max_id from ...
) , and give each threads a range to work with. E.g. for max_id=3000
and 3 threads, start them with one of (0..1000), (1001, 2000), (2001..3000)
and include that into the first query:
select max(id) as new_max from
where id > last_id
and id >= $threadmin_id and id <= $threadmax_id
and <your condition>
order by id limit 1000
It may depend on your data distribution if those ranges are equally sized (and you may find better ranges in your situation; calculating the exact ranges would require to execute the query though, so you probably can't be exact).