I am trying to optimise my SQL query for use in a big DB2 table with 10,000s of records. I need to extract rows from Table 1, where the date is older than the rows in Table 2. Table 2 can have many rows that are more recent but I only care about the Table 1 data. I have a query but it's really slow and I am looking to optimise it further. I have gone for this approach:
SELECT * FROM TABLE1 t1
WHERE t1.START_DATE < (
SELECT MAX(DATE(t2.TIMESTAMP_DATE))
FROM TABLE2 t2
WHERE t1.NAME = t2.NAME
);
Here is an example dataset which runs fine with low volumes but my big dataset is running too slow...
Table 1:
NAME | START_DATE |
---|---|
JOHN | 2018-01-01 |
JIM | 2019-01-01 |
ALEX | 2023-01-01 |
Table 2:
NAME | TIMESTAMP_DATE |
---|---|
JOHN | 2022-02-22-09.20.23.891000 |
JOHN | 2022-03-22-09.20.23.891000 |
JOHN | 2022-04-22-09.20.23.891000 |
JIM | 2022-01-22-09.20.23.891000 |
JIM | 2022-02-22-09.20.23.891000 |
ALEX | 2022-05-01-09.20.23.891000 |
So from my query, I want to extract the follow dataset:
NAME | START_DATE |
---|---|
JOHN | 2018-01-01 |
JIM | 2019-01-01 |
Could I please get any tips to optimise for a massive dataset in both tables?
Building a list of names and max dates from T2 and joining is more efficient, try this :
select
t1.*
from table1 t1
inner join (
select
name,
date(max(timestamp_date)) maxdate
from table2 t2
group by name
) b on b.name = t1.name
where t1.start_date < b.maxdate