Search code examples
mysqldatabasetransactionsdatabase-performance

How do MySQL Transactions work under the hood? Where does MySQL store the temporary value of a field?


I understand that a MySQL transaction will let you perform multiple inserts/updates at once where either all or none of them will succeed and I understand how to use them.

What I am not sure about is how MySQL manages to hold all the data over long periods of time and what affect this might have performance.

Selecting before another transaction has committed

I have a table of 100 people named "John" and I update every name to "Jane" in a loop inside a transaction, each update takes 1 second meaning it takes 100 seconds to finish the transaction.

If I make a select from another process during that 100 seconds the result will be "John" rather than "Jane". If I make a select after the transaction is committed it will return "Jane".

This is all fine and i'm not really confused about this works.

Selecting within a transaction

This is the more confusing bit.

I have a table of 100 people named "John" and I start a transaction in which I loop through and select each row one by one. Each select query takes 1 second so this takes 100 seconds.

After 50 seconds another process, not within a transaction, updates every row to "Jane".

In my first process, within the transaction, I will still receive "John" as a result even after the update to "Jane" has completed.

To be clear the timing would be like so:

  • 12:00:00 - All rows say John and a select begins in a transaction that takes 1 second per row
  • 12:00:30 - All rows are updated to Jane
  • 12:00:31 - Row 31 is selected from the first transaction and still returns "John" rather than "Jane".

How does it work under the hood

So now I could execute SELECT name FROM names WHERE id = 31 at the exact same time and have one return "John" and one return "Jane" depending on whether I was in a transaction, or when the transaction started.

MySQL must then be storing the value of this field twice in some way.

Does it take a copy?

I don't think it takes a copy of the database or table, since when you begin a transaction it doesn't know what tables you're going to touch. You may not touch a table until 10 minutes into the transaction and yet the data is at it was 10 minutes ago, no matter how many modifications other processes made in the mean time.

I've also experimented with databases and tables that are GB's in size and take minutes to dump, there's no way it's making entire copies.

Temporary hold somewhere?

Perhaps it temporarily holds the value of the field somewhere waiting for the transaction to finish?

It would then need to check if there's a pending value when performing a select.

Therefore doing SELECT name FROM names WHERE id = 31 would be the equivalent of something like:

// John
if (pending_value('names', 'name', 31) {
   // Jane
   name = get_pending_value('names', 'name', 31);
} else {
   // John
   name = get_db_value('name', 'name', 31);
}

That is obviously very dumb pseudo code, but it's essentially saying "is there a pending update? If yes, use that instead"

This would presumably be held in memory somewhere? Or perhaps a file? Or one of the system databases?

How does it affect performance

If my names table had 1 billion rows and we performed the same queries then MySQL would simultaneously know that 1 billion rows held the value "John" and that 1 billion rows held the value "Jane". This must surely impact performance.

But is it the query within the transaction that is impacted or the query outside the transaction?

e.g.

  1. Process 1 = Begin transaction
  2. Process 2 = UPDATE names SET name = "Jane"
  3. Process 1 = SELECT name FROM names WHERE id = 31 //John
  4. Process 2 = SELECT name FROM names WHERE id = 31 //Jane

Does the query in step (3) or step (4) have a performance impact or both?


Solution

  • Some clues:

    • Read about "MVCC" -- MultiVersion Concurrency Control
    • A tentatively-changed row is kept until COMMIT or ROLLBACK. (See "history list" in the documentation.) This is row-by-row, not whole table or database. It will not "escalate the row locks to a table lock".
    • Each row of each table has a transaction_id. Each new transaction has a new, higher, id.
    • That xaction id, together with the "transaction isolation mode", determine which copy of each row your transaction can "see". So, yes, there can briefly be multiple "rows" WHERE id = 31.
    • Rows are locked, not tables. In some of your examples, transactions ran for a while, then stumbled over the 'same' row.
    • In some cases, the "gap" between rows is locked. (I did not notice that in your examples.)
    • Whole tables are locked only for DDL (Drop, Alter, etc), not DML (Select, Update, etc)
    • When a conflict occurs, a "deadlock" might occur. This is when each transaction is waiting for the other one to release a lock. One transaction is automatically rolled back.
    • When a conflict occurs, a "lock wait" might occur. This is when the transaction with a lock will eventually let go, letting the waiting transaction continue.
    • When a conflict occurs and "lock wait" occurs, innodb_lock_wait_timeout controls how long before giving up.
    • Every statement is inside a transaction. When autocommit=ON, each statement is a transaction unto itself. (Your last example is missing a BEGIN, in which case Process 2 has 2 separate transactions.)

    In your first example, the isolation mode of read_uncommitted would let you see the other transaction's changes as they happened. That is a rarely used mode. The other modes won't let you see the changes until they are COMMITted, and it would never see the changes if it were ROLLBACK'd. (Yes, there was a copy of each changed row.)

    repeatable_read mode (and others) effectively limit you to seeing only the rows with your transaction_id or older. Hence, even at 12:00:31, you still see "John".

    General advice:

    • Don't write a transaction that runs longer than a few seconds
    • Remember to use SELECT ... FOR UPDATE where appropriate -- this adds a stronger lock on the rows in the SELECT just in case they will be updated or deleted in the transaction.
    • Where practical, it is better to have one INSERT adding 100 rows; that will be 10 times as fast as 100 single-row INSERTs. (Similarly for UPDATE and DELETE.)
    • Use SHOW ENGINE=InnoDB STATUS; (I find it useful in dealing with deadlocks, but cryptic for other purposes.)