I'm having problems with a slow SQL query running on the following system:
The table schemas and some sample data (no DB Fiddle as it doesn't support MariaDB):
DROP TABLE IF EXISTS item_prices;
DROP TABLE IF EXISTS prices;
DROP TABLE IF EXISTS item_orders;
CREATE TABLE item_orders
(
id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
ordered_date DATE NOT NULL
) Engine=InnoDB;
CREATE TABLE prices
(
id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
selected_flag TINYINT UNSIGNED NOT NULL
) Engine=InnoDB;
CREATE TABLE item_prices
(
item_order_id INT UNSIGNED NOT NULL,
price_id INT UNSIGNED NOT NULL,
PRIMARY KEY (item_order_id, price_id),
FOREIGN KEY (item_order_id) REFERENCES item_orders(id),
FOREIGN KEY (price_id) REFERENCES prices(id)
) Engine=InnoDB;
INSERT INTO item_orders VALUES (1, '2022-01-01');
INSERT INTO item_orders VALUES (2, '2022-02-01');
INSERT INTO item_orders VALUES (3, '2022-03-01');
INSERT INTO prices VALUES (1, 0);
INSERT INTO prices VALUES (2, 0);
INSERT INTO prices VALUES (3, 1);
INSERT INTO prices VALUES (4, 0);
INSERT INTO prices VALUES (5, 0);
INSERT INTO prices VALUES (6, 1);
INSERT INTO item_prices VALUES (1, 1);
INSERT INTO item_prices VALUES (1, 2);
INSERT INTO item_prices VALUES (1, 3);
INSERT INTO item_prices VALUES (2, 4);
INSERT INTO item_prices VALUES (2, 5);
INSERT INTO item_prices VALUES (3, 6);
A high-level overview of the table usage is:
What I want to get, in a single query, are:
At the moment I have the following query:
SELECT
COUNT(item_orders.id) AS item_order_count,
SUM(CASE WHEN prices.id IS NOT NULL THEN 1 ELSE 0 END) AS item_order_selected_count
FROM
item_orders
LEFT JOIN prices ON prices.id IN (
SELECT price_id
FROM item_prices
WHERE
item_prices.item_order_id = item_orders.id)
AND prices.selected_flag = 1
This query returns the correct data (item_order_count = 3, item_order_selected_count = 2), however it takes a long time (over 10 seconds) to run on a live dataset, which is too slow for users (it is a heavily-used report, refreshed repeatedly through the day). I think the problem is the subquery in the LEFT JOIN, as removing the LEFT JOIN and the associated SUM reduces the query time to around 0.1 seconds. Also, the EXPLAIN output for the join has this in the Extra column:
Using where; Using join buffer (flat, BNL join)
Searching for 'flat BNL join' reveals a lot of information, of which the summary seems to be: 'BNL joins are slow, avoid them if you can'.
Is it possible to rewrite this query to return the same information, but avoiding the BNL join?
Things I've considered already:
All the ID columns are indexed (item_orders.id, prices.id, item_prices.item_order_id, item_prices.price_id).
Splitting the query in two - one for item_order_count (no JOIN), the other for item_order_selected_count (INNER JOIN, as I only need rows which match). This works but isn't ideal as I want to build up this query to return more data (I've stripped it back to the minimum for this question). Also, I'm trying to keep the query output as close as possible to what the user will see, as that makes debugging easier and makes the database (which is optimised for that workload) do the work, rather than the application.
Changing the MariaDB configuration: Some of the search results for BNL joins suggest changing configuration options, however I'm wary of doing this as there are hundreds of other queries in the application and I don't want to cause a regression (e.g. speed up this query but accidentally slow down all the others).
Upgrading MariaDB: This would be a last resort as it would involve using a version different to that packaged with Debian, might break other parts of the application, and the system has just been through a major upgrade.
I came back to this question this week as the performance got even worse as the number of rows increased, to the point where it was taking over 2 minutes to run the query (with around 100,000 rows in the item_orders table, so hardly 'big data').
I remembered that it was possible to list multiple tables in the FROM clause and wondered if the same was true of a LEFT JOIN. It turns out this is the case and the query can be rewritten as:
SELECT
COUNT(item_orders.id) AS item_order_count,
SUM(CASE WHEN prices.id IS NOT NULL THEN 1 ELSE 0 END) AS item_order_selected_count
FROM
item_orders
LEFT JOIN (item_prices, prices) ON
item_prices.item_order_id = item_orders.id
AND prices.id = item_prices.price_id
AND prices.selected_flag = 1
This returns the same results but takes less than a second to execute. Unfortunately I don't know any relational algebra to prove this, but effectively what I am saying is 'only LEFT JOIN where everything matches on both item_prices and prices'.