a is a table of 5,000,000 records and a score. b is a table of 100,000 records and a (different) score. I want to output every record from table a as long as the same record is not in table b with a score equal to 1.
The following query I wrote has taken 15 hours to execute and is still running. Is it possible that it could take this long to do such a query or is there perhaps something wrong with my code?
SELECT a.c1, a.c2, a.score
INTO OUTFILE 'file.csv'
FROM a
LEFT OUTER JOIN b
ON a.c1=b.c1
AND a.c2=b.c2
WHERE b.score <> 1
Your query retrieves all records from table a that are also in table b with a score not equal to 1 (and it retrieves those records as many times as they appear in table b with a score other than 1). Which is different from "every record from table a as long as the same record is not in table b with a score equal to 1".
SELECT a.c1, a.c2, a.score
INTO OUTFILE 'file.csv'
FROM a
WHERE NOT EXISTS (
SELECT 1 FROM b
WHERE a.c1 = b.c1
AND a.c2 = b.c2
AND b.score = 1
)
The query above can take advantage of the following index
create index my_idx on b(c1,c2,score)