I have 2 tables, one of them student name has 35 M rows, and another table (name is index) has 1500 rows.
When I use this query:
SELECT DISTINCT family
FROM [DBName].[dbo].[student]
it takes about 3 seconds to run and return results (given that it has 35 million rows, but soon return the result).
But when I run
DELETE FROM [DBName].[dbo].[index]
WHERE family NOT IN (SELECT DISTINCT family FROM [DBName].[dbo].[student])
that's very slow and I waited for thirty minutes but it was not finished yet.
Why?
This is an apples to oranges comparison, since a DELETE
is writing to the transaction log and SELECT
is not.
The DISTINCT
works by virtue of sorting the output rows in order to find duplicates. This will be made really slow if you are missing an index (for what it's worth 3 seconds seems slow to me for output). Even slower when performing a NOT IN
on the same column. This doesn't take into consideration any constraints you might have on that column, which could further slow it down.
You can likely make this faster using GROUP BY
.
DELETE
FROM [DBName].[dbo].[index]
WHERE family NOT IN (
SELECT family
FROM [DBName].[dbo].[student]
GROUP BY family)
Likely faster still using NOT EXISTS
instead of NOT IN
:
DELETE
FROM [DBName].[dbo].[index] AS outer
WHERE family NOT EXISTS (
SELECT family
FROM [DBName].[dbo].[student] AS inner
WHERE inner.family = outer.family
GROUP BY family)