I am attempting a mass delete of old data from a huge table with 80,000,000 rows, about 50,000,000 rows will be removed. This will be done in batches of 50k to avoid database log overflow. Also the rows of the table are not sorted chronologically. I've come up with the following script:
BEGIN
DECLARE @START_TIME DATETIME,
@END_TIME DATETIME,
@DELETE_COUNT NUMERIC(10,0),
@TOTAL_COUNT NUMERIC(10,0),
@TO_DATE DATETIME,
@FROM_DATE DATETIME,
@TABLE_SIZE INT
SELECT @START_TIME = GETDATE()
PRINT 'Delete script Execution START TIME = %1!', @START_TIME
SELECT @TABLE_SIZE = COUNT(*) FROM HUGE_TABLE
PRINT 'Number of rows in HUGE_TABLE = %1!', @TABLE_SIZE
SELECT @DELETE_COUNT = 1,
@TOTAL_COUNT = 0,
@TO_DATE = DATEADD(yy, -2, GETDATE())
CREATE TABLE #TMP_BATCH_FOR_DEL (REQUEST_DT DATETIME)
WHILE(@DELETE_COUNT > 0)
BEGIN
DELETE FROM #TMP_BATCH_FOR_DEL
INSERT INTO #TMP_BATCH_FOR_DEL (REQUEST_DT)
SELECT TOP 50000 REQUEST_DT
FROM HUGE_TABLE
WHERE REQUEST_DT < @TO_DATE
ORDER BY REQUEST_DT DESC
SELECT @FROM_DATE = MIN(REQUEST_DT), @TO_DATE = MAX(REQUEST_DT)
FROM #TMP_BATCH_FOR_DEL
PRINT 'Deleting data from %1! to %2!', @FROM_DATE, @TO_DATE
DELETE FROM HUGE_TABLE
WHERE REQUEST_DT BETWEEN @FROM_DATE AND @TO_DATE
SELECT @DELETE_COUNT = @@ROWCOUNT
SELECT @TOTAL_COUNT = @TOTAL_COUNT + @DELETE_COUNT
SELECT @TO_DATE = @FROM_DATE
COMMIT
CHECKPOINT
END
SELECT @END_TIME = GETDATE()
PRINT 'Delete script Execution END TIME = %1!', @END_TIME
PRINT 'Total Rows deleted = %1!', @TOTAL_COUNT
DROP TABLE #TMP_BATCH_FOR_DEL
END
GO
I did a practice run and found the above was deleting around 2,250,000 rows per hour. So, it would take 24+ hours of continuous runtime to delete my data.
I know it's that darn ORDER BY clause within the loop that's slowing things down, but storing the ordered table in another temp table would take up too much memory. But, I can't think of a better way to do this. Thoughts?
It is probably not the query itself. Your code is deleting about 600+ records per second. A lot is going on in that time -- logging, locking, and so on.
A faster approach is to load the data you want into a new table, truncate the old table, and reload it:
select *
into temp_huge_table
from huge_table
where request_dt > ?; -- whatever the cutoff is
Then -- after validating the results -- truncate the huge table and reload the data:
truncate table huge_table;
insert into huge_table
select *
from temp_huge_table;
If there is an identity column you will want to disable that to allow identity insert. You might have to take other precautions if there are triggers that set values in the table. Or if there are foreign key references to rows in the table.
I would not suggest doing this directly. After you have truncated the table, you should probably partition by the table by date -- by day, week, month, whatever.
Then, in the future, you can simply drop partitions rather than deleting rows. Dropping partitions is much, much faster.
Note that loading a few tens of millions of rows into an empty table is much, much faster than deleting them, but it still takes time (you can test how much time on your system). This requires downtown for the table. However, you hopefully have a maintenance period where this is possible.
And, the downtime can be justified by partitioning the table so you won't have this issue in the future.