Search code examples
phpmysqldatabase-migration

MySQL ID Rolling


I am working on a rather large database migration to a new database design. The existing structure had multiple tables for the same data being represented for different stores.

For example:

`store1_tickets`
--------------------
| id    | customer |
--------------------
|     1 |       29 |
--------------------

`store2_tickets`
--------------------
| id    | customer |
--------------------
|     1 |       54 |
--------------------

I am now consolidating into a table like this:

`tickets`
----------------------------------------
| id    | legacy_id | store | customer |
----------------------------------------
|     1 |         1 |     1 |       29 |
|     2 |         1 |     2 |       54 |
----------------------------------------

This pattern repeats for several components (customers, vendors, appointments...).

I am making a (PHP) script to do an ETL into INSERT statements. It has to keep a running total of the new ticket IDs while transforming the data. After the INSERT statement, I am making an UPDATE statement to change the corresponding IDs in other tables (such as changing the customer field in the tickets table once I've re-numbered the customers table.

I'm afraid of running the UPDATEs (after all of the INSERTs) and having it do a kind of cascade, where it changes customer 1 into 54, then when it reaches customer 54, changing that into 243, and so on.

How can I properly fix the IDs? The ticket table is the only one keeping the legacy ID since I will actually be using that as a multi-column auto_increment (each store has to have its own incrementing ticket ID for display purposes). The complexity comes in the fact that there are so many tables that reference each other, so that greatly complicates updating any of the IDs outright in the script.

Is there any better approach to this, or some way of preventing the UPDATEs from cascading? I almost think something like starting the id off at a really high number (would have to be atleast 100k due to the records count), then after everything is said and done I could decrement all of the IDs by that value.


Solution

  • I decided to go with the "start off with a high ID" method I proposed.

    My script goes something like this (SUPER simplified from the real code).

    <?php
        /* Temp ID to add to each legacy ID
           This MUST be MUCH GREATER than any record counts combined can possibly reach
           In my case, all rows in my existing 7 tables (per data type)
           totalled about ~350,000, so I rounded up generously to be safe
        */
        define('TEMP_ID', 400000);
    
        // Whole process repeated for each store, and numerous types of data
        $query = $db->query('SELECT * FROM store1_tickets');
    
        // Insert statement
        $insert_sql = 'INSERT INTO tickets SET id = %1$d, legacy_id = %2$d,
            store = 1, customer = %3$d;';
    
        // Update for other tables linked to this ID
        $update_sql = 'UPDATE logs SET ticket = %1$d WHERE ticket = %2$d;'."\n";
        $update_sql.= 'UPDATE appointments SET ticket = %1$d WHERE ticket = %2$d;';
    
        // Counter (kept for going between sessions for large datasets)
        $ticket_id = 0;
    
        while($row = $db->fetch_row($query)){
    
            /* Insert with temp IDs for references to legacy data
               Note how the legacy id itself is stored un-touched, we are
               just worried about links to other tables
            */
            $sql .= sprintf($insert_sql, ++$ticket_id, $row['id'],
                $row['customer'] + TEMP_ID);
    
            /* Now I store an update for the tables linked to this ticket
               to update to the new ID from the old (with temp added)
            */
            $patch .= sprintf($update_sql, $ticket_id, $row['id'] + TEMP_ID);
    
         }
    
    ?>
    

    I then run everything from $sql to enter the raw data, then run the $patch statements to fix all the links.

    Now once I've run all of these queries (stored in a .sql file and dumped in using a batch script), I run a few overall "clean-up" queries to reduce the IDs back down where needed.

    UPDATE logs SET ticket = ticket - 400000 WHERE ticket > 400000;
    UPDATE appointments SET ticket = ticket - 400000 WHERE ticket > 400000;
    UPDATE tickets SET customer = customer - 400000 WHERE customer > 400000;
    

    This worked beautifully for me with over 100MB of MySQL data from a few hundred tables. Took quite a bit of time to run the actual queries (I recommend the awesome script BigDump), but it seems to have everything looking good so far.