I have three tables:
grade (grade_id, grade_value, grade_date) ~100M rows
grade_archive (grade_id, grade_value, grade_date) 0 rows
peer_review (grade_id, peer_review_value, peer_review_date) ~10M rows
I want to move all rows from table grade
to grade_archive
that are older than a month and are not in table peer_review
.
The tables are actively used so any insert would have to be low priority to avoid interrupting existing and new processes while this runs.
Expected table rows should look something like this when done:
grade ~10M rows
grade_archive ~90M rows
peer_review ~10M rows
I imagine it's something close to:
INSERT
LOW_PRIORITY
INTO grade_archive
(grade_id,grade_value,grade_date)
SELECT
grade_id,grade_value,grade_date
FROM
grade
WHERE
grade_date < DATE_ADD(NOW(), INTERVAL -1 MONTH)
AND grade_id NOT IN
(
SELECT grade_id FROM peer_review
);
And then clean up the grade
table by removing all rows in the archive table:
DELETE LOW_PRIORITY FROM grade WHERE grade_id IN (SELECT grade_id FROM grade_archive);
But these subselects are very slow with large tables and I'm nervous about the result. Looking for some better direction.
I have had a similar problem in the past with migrating a portion of data from a large active tables into an archive table. The approach I used (modified for your use case) is as follows:
/* Set time for calculation basis */
SET@calc_time = NOW();
/* Create empty copy of grade table */
CREATE TABLE grade_temp LIKE grade;
/* Add rows you want to save from grade into temp table */
INSERT INTO grade_temp
SELECT
g.grade_id AS grade_id,
g.grade_value AS grade_value,
g.grade_date AS grade_date
FROM grade AS g
LEFT JOIN peer_review AS pr
ON g.grade_id = pr.grade_id
WHERE
/*
To keep the record it must either have an entry in peer review
or it is less than a month old
*/
pr.grade_id IS NOT NULL
OR g.grade_date >= DATE_SUB(@calc_time, INTERVAL 1 MONTH);
/*
Switch new temp table for active table.
This happens really fast (it is just file name switching on the system).
*/
RENAME TABLE grade TO grade_old, grade_temp TO grade;
/*
You are now taking new records into new version of grade table
and free to do your much slower operations against the grade_old table
*/
/* Delete more recent rows */
DELETE FROM grade_old
WHERE grade_date >= DATE_SUB(@calc_time, INTERVAL 1 MONTH);
/* Delete rows that exist in peer review */
DELETE FROM grade old
WHERE grade_id IN (
SELECT grade_id
FROM peer_review
WHERE grade_date < DATE_SUB(@calc_time, INTERVAL 1 MONTH)
);
/*
As an alternate to the above action, you could also try deleting across join as shown below. Which is faster will likely depend upon number of records that are returned from that subquery shown above. You can try both out and see what works best
*/
DELETE go FROM grade_old AS go
INNER JOIN peer_review AS pr
ON go.grade_id = pr.grade_id
WHERE pr.grade_date < DATE_SUB(@calc_time, INTERVAL 1 MONTH);
/* Add all rows from grade_old to grade_archive */
INSERT INTO grade_archive
SELECT
grade_id,
grade_value,
grade_date
FROM grade_old;
/* Drop date_old table */
DROP TABLE date_old;
The key here is in getting a new version of grade table which contains only the rows needed in place as quickly as possible and then sort out what goes into the archive table after the fact. You don't want to do any mass delete operations on a table of that size. This keeps the time where you grade table is tied up on these archival operations to a minimum.
I will say however that your database schema seems like it could be optimized for this sort of operation. For example, you could have a peer review flag on your grade table that you could use for quicker filtering rather than having to filter across a join. I am actually questioning the need for that peer review table as a whole unless it has a many to one relationship with grade table (which doesn't seem to be indicated in your question). If there is only ever one peer review entry per grade_id, I would think that these columns should just be normalized into the grade table. That would greatly simplify this maintenance process.