Search code examples
pythonmysqlpymysql

Check if an INSERT with a SELECT was successfull in PyMySQL


I had an INSERT query where it got values from a SELECT statement. But since the SELECT returns millions of records, it put too much load on the MySQL server. So, we decided to break the SELECT query into parts and execute by having a LIMIT clause.

INSERT INTO target_table 
    SELECT * FROM source_table
    WHERE my_condition = value
    ...
    LIMIT <start>, <end>

We will keep increasing start and end values until SELECT returns 0 rows. I'm also thinking of making this multi-threaded.

How can I do it with PyMySQL?

Do I need to execute the SELECT, get the results and then generate the INSERT?


Solution

  • First of all, to answer your question: in PyMySQL, you get that value as the result of cursor.execute:

    execute(query, args=None)
    
    Execute a query
    
    Parameters:   
        query (str) – Query to execute.
        args (tuple, list or dict) – parameters used with query. (optional)
    
    Returns: Number of affected rows
    

    So you could just execute your query repeatedly until you get a value less then your selected range as a result.

    Anyway, please consider:

    • the first thing you should check if you can optimize your select (assuming it's not as simple as in your example), e.g. by adding indexes. You may also want to test the difference between just selecting and actually inserting to get a rough idea which part is more relevant.
    • if the insertion is causing the problem, it can be due to the size of the transaction. In that case, splitting it up will only reduce the problems if you can also split up the transaction (although since you consider executing queries in parallel, this doesn't seem to be a concern)
    • if a query generates too much (cpu) load, running multiple instances of that query in parallel can, at best, only spread it over multiple cores, which will actually reduce the available cpu time for other queries. If "load" is related to I/O-load, effects of limited resources or "general responsiveness" , it it possible though, e.g. a small query might generate a small temporary table in memory, and big query generates a big temporary table on disk (although specifically with offset, this is unlikely, see below.) Otherwise, you would usually need to add small pauses between (small enough) parts that you run consecutively, to spread the same workload over a longer time.
    • limit only makes sense if you have an order by (probably by the primary key), otherwise, in successive runs, the m-th row can be a different row than before (because the order is not fixed). This may or may not increase the load (and resource requirements) depending on your indexes and your where-condition.
    • the same is true for updates to your source table, as if you add or remove a row from the resultset (e.g. changing the value of my_condition of the first row), all successive offsets will shift, and you may skip a row or get a row twice. You will probably need to lock the rows, which might prevent running your queries in parallel (as they lock the same rows), and also might influence the decision if you can split the transaction (see 2nd bullet point).
    • using an offset requires MySQL to first find and then skip rows. So if you split the query in n parts, the first row will need to be processed n times (and the last row usually once), so the total work (for selecting) will be increased by (n^2-n)/2. So especially if selecting the rows is the most relevant part (see 1st bullet point), this can actually make your situation much worse: just the last run will need to find the same amount of rows as your current query (although it throws most of them away), and might even need more resources for it depending on the effect of order by.

    You may be able to get around some of the offset-problems by using the primary key in the condition, e.g. have a loop that contains something like this:

    select max(id) as new_max from 
    where id > last_id and <your condition>  
    order by id limit 1000  -- no offset!
    

    Exit the loop if new_max is null, otherwise do the insert:

    insert ... select ... 
    where id > last_id and id <= new_max and <your condition>
    

    Then set last_id = new_max and continue the loop.

    It doubles the number of queries, as in contrast to limit with an offset, you need to know the actual id. It still requires your primary key and your where-condition to be compatible (so you may need to add an index that fits). If your search condition finds a significant percentage (more than about 15% or 20%) of your source table anyway, using the primary key might be the best execution plan anyway though.

    If you want to parallize this (depending on your transaction requirements and if it is potentially worthwile, see above), you could first get the maximum value for primary key (select max(id) as max_id from ...) , and give each threads a range to work with. E.g. for max_id=3000 and 3 threads, start them with one of (0..1000), (1001, 2000), (2001..3000) and include that into the first query:

    select max(id) as new_max from 
    where id > last_id 
      and id >= $threadmin_id and id <= $threadmax_id
      and <your condition>  
    order by id limit 1000 
    

    It may depend on your data distribution if those ranges are equally sized (and you may find better ranges in your situation; calculating the exact ranges would require to execute the query though, so you probably can't be exact).