At what point is MySQL primary key error thrown?

If I have a batch insert statement like:

INSERT INTO TABLE VALUES (x,y,z),(x2,y2,z2),(x3,y3,z3);

And x2 violates a primary key, is the error thrown before or after the processing of x3?

Specifically, I have a bunch of batch inserts in a try-catch block using Python and PyMySQL, like:

conn = myDB.cursor() 
try:
     conn.execute("INSERT INTO TABLE VALUES (x,y,z),(x2,y2,z2),(x3,y3,z3);")
except pymysql.Error as  msg:
     print("MYSQL ERROR!:{0}".format(msg)) #print error

I want to make sure that if one of the tuples in the batch insert fails, thus printing the error, the rest of the tuples in that same batch were still processed.

My motivation is that I am transferring LOTS of data across two servers. In server 1 the data is stored in log files, and it is being inserted into MySQL on server 2. Some of the data is already in MySQL on server 2, so there are many failures. However, if I do not use batch inserts, and I have a separate INSERT INTO for every of the (millions of) records things seem to run much more slowly. So I'm in trouble either way: with batch inserts, duplicate failures blow up the whole statement, and without batch inserts the process takes much much longer.

Solution

The way that MySQL deals with multiple insert (or update) statements varies by table engine and server SQL mode.

While only the table engine is really important to the key constraints you're asking about here, it's important to understand the bigger picture, so I'm going to take the time to add some extra details. If you're in a hurry, feel free to just read the first and last sections below.

Table Engines

In the case of a nontransactional table engine like MyISAM, you can easily end up performing a partial update because each insert or update is performed sequentially and cannot be rolled back when the bad row is encountered and the statement is aborted.

However, if you use a transactional table engine like InnoDB, any constraint violation during an insert or update statement will trigger a rollback of any changes made up to that point, in addition to aborting the statement.

SQL Modes

The server SQL mode becomes important when you're not violating a key constraint, but the data you're trying to insert or update doesn't fit the definition of the column you're putting it into. For example:

inserting a row without giving values for every NOT NULL column
inserting '123' into a column defined with a numeric type (rather than 123)
updating a CHAR(3) column to hold the value 'four'

In these cases, MySQL will throw an error if strict mode is in effect. However, if strict mode is not in effect, it will often "fix" your mistake instead, which can cause all manner of potentially harmful behavior (see MySQL 'Truncated incorrect INTEGER value' and mysql string conversion return 0 for just two examples).

Danger, Will Robinson!

There are some potential "gotchas" with nontransactional tables and strict mode. You haven't told us which table engine you're working with, but this answer as currently written is clearly using a nontransactional table and it's important to know how that affects the result.

For example, consider the following set of statements:

SET sql_mode = '';  # This will make sure strict mode is not in effect

CREATE TABLE tbl (
  id INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
  val INT
) ENGINE=MyISAM;  # A nontransactional table engine (this used to be the default)

INSERT INTO tbl (val) VALUES (1), ('two'), (3);

INSERT INTO tbl (val) VALUES ('four'), (5), (6);

INSERT INTO tbl (val) VALUES ('7'), (8), (9);

Since strict mode is not in effect, it shouldn't be surprising that all nine values are inserted and the invalid strings are coerced to integers. The server is clever enough to recognize '7' as a number but doesn't recognize 'two' or 'four', so they get converted to the default value for numeric types in MySQL:

mysql> SELECT val FROM tbl;
+------+
| val  |
+------+
|    1 |
|    0 |
|    3 |
|    0 |
|    5 |
|    6 |
|    7 |
|    8 |
|    9 |
+------+
9 rows in set (0.00 sec)

Now, try doing that again with sql_mode = 'STRICT_ALL_TABLES'. To make a long story short, the first INSERT statement will result in a partial insert, the second will fail entirely and the third will silently coerce '7' to 7 (which doesn't seem very "strict" if you ask me, but it's documented behavior and not that unreasonable).

But wait, there's more! Try it with sql_mode = 'STRICT_TRANS_TABLES'. Now you'll find that the first statement throws a warning instead of an error - but the second statement still fails! This can be particularly frustrating if you're using LOAD DATA with a bunch of files and some are failing while others aren't (see this closed bug report).

What to Do

In the case of key violations specifically, what matters is only whether the table engine is transactional (example: InnoDB) or not (example: MyISAM). If you're working on a transactional table, the Python code in your question will cause the MySQL server to do things in this order:

Parse the INSERT statement and start a transaction.*
Insert the first tuple.
Insert the second tuple (key constraint is violated).
Rollback the transaction.
Send an error message to pymysql.

^{*It would make sense for the statement to be parsed before starting a transaction, but I don't know the exact implementation so I'll put these together as one step.}

In this case, any changes prior to the bad tuple would already have been reversed by the time your script receives an error message from the server and enters the except block.

If you're working on a nontransactional table, however, the server will skip step 4 (and the relevant part of step 1) because the table engine doesn't support transaction statements. In this case, at the time your script enters the except block, the first tuple has been inserted, the second has blown up, and you may not be able to easily determine how many rows were successfully inserted because the function that normally does that returns -1 if the last insert or update statement threw an error.

Partial updates should be strictly avoided; they're much harder to fix than simply making sure your statement succeeds entirely or fails entirely. In this type of situation, the documentation suggests:

To avoid [a partial update], use single-row statements, which can be aborted without changing the table.

And in my opinion, that's exactly what you should do. It's hardly difficult to write a loop in Python and you won't have to repeat code as long as you're inserting values properly as parameters rather than hard-coding them - which you're already doing, right? RIGHT??? >:(

Alternative Alternatives

If you expect to violate your constraint sometimes and you want to take some other action when the row you try to insert turns out already to exist, then you might be interested in `INSERT ... ON DUPLICATE KEY UPDATE'. This lets you perform such amazing feats of computational gymnastics as counting stuff:

mysql> create table counting_is_fun (
    -> stuff int primary key,
    -> ct int unsigned not null default 1
    -> );
Query OK, 0 rows affected (0.12 sec)

mysql> insert into counting_is_fun (stuff)
    -> values (1), (2), (5), (3), (3)
    -> on duplicate key update count = count + 1;
Query OK, 6 rows affected (0.04 sec)
Records: 5  Duplicates: 1  Warnings: 0

mysql> select * from counting_is_fun;
+-------+-------+
| stuff | count |
+-------+-------+
|     1 |     1 |
|     2 |     1 |
|     3 |     2 |
|     5 |     1 |
+-------+-------+
4 rows in set (0.00 sec)

(Note: Compare the number of tuples you inserted to the number of "rows affected" by the query and the number of rows in the table afterward. Isn't counting fun?)

Or, if you think the data you're inserting right now is at least as good as the data currently in the table, you could look into REPLACE INTO - but this is a MySQL-specific extension to the SQL standard and as usual, it has its quirks, particularly with respect to AUTO_INCREMENT fields and ON DELETE actions associated with foreign key references.

One other approach people love to suggest is INSERT IGNORE. This ignores the error and just keeps on rolling. Great, right? Who needs errors, anyway? The reasons I don't like this as a solution are:

INSERT IGNORE will cause any error that occurs during the statement to be ignored, not just whatever error you think you don't care about.
The documentation states, "Ignored errors may generate warnings instead, although duplicate-key errors do not." So you don't even necessarily know which warnings to expect when using this keyword!
To me, using INSERT IGNORE says, "I don't know how to do this the right way, so I'm just going to do it the wrong way."

I do use INSERT IGNORE sometimes, but when the documentation flat-out tells you the "right way" to do something, don't outsmart yourself. Try it that way first; if you still have a really good reason to do it the wrong way and risk violating the integrity of your data and ruining everything forever, at least you've made an informed decision.