From the following table:
id_ | p1_id | p2_id | match_date | p1_stat | p2_stat |
---|---|---|---|---|---|
852666 | 1 | 2 | 01/01/1997 | 1301 | 249 |
852842 | 1 | 2 | 13/01/1997 | 2837 | 2441 |
853471 | 2 | 1 | 05/05/1997 | 1474 | 952 |
4760 | 2 | 1 | 25/05/1998 | 1190 | 1486 |
6713 | 2 | 1 | 18/01/1999 | 2084 | 885 |
9365 | 2 | 1 | 01/11/1999 | 2894 | 2040 |
11456 | 1 | 2 | 15/05/2000 | 2358 | 1491 |
13022 | 1 | 2 | 14/08/2000 | 2722 | 2401 |
29159 | 1 | 2 | 26/08/2002 | 431 | 2769 |
44915 | 1 | 2 | 07/10/2002 | 1904 | 482 |
I'd like to return both players' stats for their respective previous matches no matter whether the player in question was p1 or p2 in their last match. See below for the expected output:
id_ | p1_id | p2_id | match_date | p1_stat | p2_stat | p1_prev_stat | p2_prev_stat |
---|---|---|---|---|---|---|---|
852666 | 1 | 2 | 01/01/1997 | 1301 | 249 | NULL | NULL |
852842 | 1 | 2 | 13/01/1997 | 2837 | 2441 | 1301 | 249 |
853471 | 2 | 1 | 05/05/1997 | 1474 | 952 | 2441 | 2837 |
4760 | 2 | 1 | 25/05/1998 | 1190 | 1486 | 1474 | 952 |
6713 | 2 | 1 | 18/01/1999 | 2084 | 885 | 1190 | 1486 |
9365 | 2 | 1 | 01/11/1999 | 2894 | 2040 | 2084 | 885 |
11456 | 1 | 2 | 15/05/2000 | 2358 | 1491 | 2040 | 2894 |
13022 | 1 | 2 | 14/08/2000 | 2722 | 2401 | 2358 | 1491 |
29159 | 1 | 2 | 26/08/2002 | 431 | 2769 | 2722 | 2401 |
44915 | 1 | 2 | 07/10/2002 | 1904 | 482 | 431 | 2769 |
The following SQL works just fine on a table this size:
WITH cte_1 AS (
(
SELECT
id_,
match_date,
p1_id AS player_id,
p1_stat AS stat
FROM
test.match_table
UNION ALL
SELECT
id_,
match_date,
p2_id AS player_id,
p2_stat AS stat
FROM
test.match_table
)
),
cte_2 AS (
SELECT
id_,
player_id,
LAG(stat) OVER (
PARTITION BY player_id
ORDER BY
match_date,
id_
) AS prev_stat
FROM
cte_1
)
SELECT
m.*,
cte_p1.prev_stat AS p1_prev_stat,
cte_p2.prev_stat AS p2_prev_stat
FROM
test.match_table AS m
JOIN cte_2 AS cte_p1 ON cte_p1.id_ = m.id_
AND cte_p1.player_id = m.p1_id
JOIN cte_2 AS cte_p2 ON cte_p2.id_ = m.id_
AND cte_p2.player_id = m.p2_id
ORDER BY m.match_date
The actual table is circa 1.3m rows and if I run the above SQL over that it takes about 35 seconds which isn't bad.
However, my issue is when I want to add a where clause at the very end, e.g.
...
WHERE m.p1_id = 1 or m.p2_id = 1
ORDER BY m.match_date
Running this still takes circa 10 seconds on 1.3m rows when it's only returning 800 or so rows.
Would anyone have any suggestions as to how I might be able to improve this performance?
Here's the SQL to create the small table:
CREATE TABLE `match_table` (
`id_` int NOT NULL AUTO_INCREMENT,
`p1_id` int NOT NULL,
`p2_id` int NOT NULL,
`match_date` date NOT NULL,
`p1_stat` int DEFAULT NULL,
`p2_stat` int DEFAULT NULL,
PRIMARY KEY (`id_`),
KEY `ix__p1_id` (`p1_id`),
KEY `ix__p2_id` (`p2_id`),
KEY `ix__match_date` (`match_date`),
KEY `ix__comp` (`p1_id`, `p2_id`, `match_date`)
);
INSERT INTO `match_table`
VALUES
(
4760, 2, 1, '1998-05-25', 1190, 1486
),
(6713, 2, 1, '1999-01-18', 2084, 885),
(
9365, 2, 1, '1999-11-01', 2894, 2040
),
(
11456, 1, 2, '2000-05-15', 2358, 1491
),
(
13022, 1, 2, '2000-08-14', 2722, 2401
),
(
29159, 1, 2, '2002-08-26', 431, 2769
),
(
44915, 1, 2, '2002-10-07', 1904, 482
),
(
852666, 1, 2, '1997-01-01', 1301, 249
),
(
852842, 1, 2, '1997-01-13', 2837, 2441
),
(
853471, 2, 1, '1997-05-05', 1474, 952
);
The original query takes a long time because it processes the entire table before filtering the results. When you add the WHERE
clause at the end, the entire table is still processed, so the performance doesn't improve.
To make it faster, you should filter the data as early as possible. In this case, include the WHERE
clause right at the start when you're selecting from test.match_table
. This way, only relevant records (where p1_id or p2_id is 1)
are processed further, reducing the workload.
Remember, different strategies might work differently depending on the database system, table structure, indexing, and data distribution. It's always good to try a few approaches and see what works best.
Here's a simplified version of your query, moving the filtering condition to an earlier stage:
WITH cte_1 AS (
(
SELECT
id_,
match_date,
p1_id AS player_id,
p1_stat AS stat
FROM
test.match_table
WHERE
p1_id = 1
UNION ALL
SELECT
id_,
match_date,
p2_id AS player_id,
p2_stat AS stat
FROM
test.match_table
WHERE
p2_id = 1
)
),
cte_2 AS (
SELECT
id_,
player_id,
LAG(stat) OVER (
PARTITION BY player_id
ORDER BY
match_date,
id_
) AS prev_stat
FROM
cte_1
)
SELECT
m.*,
cte_p1.prev_stat AS p1_prev_stat,
cte_p2.prev_stat AS p2_prev_stat
FROM
test.match_table AS m
JOIN cte_2 AS cte_p1 ON cte_p1.id_ = m.id_ AND cte_p1.player_id = m.p1_id
JOIN cte_2 AS cte_p2 ON cte_p2.id_ = m.id_ AND cte_p2.player_id = m.p2_id
WHERE m.p1_id = 1 or m.p2_id = 1
ORDER BY m.match_date
This query adds WHERE p1_id = 1
and WHERE p2_id = 1
in the initial table scans, reducing the number of rows to be processed further.