Search code examples
sqldatabasemariadbquery-optimizationgreatest-n-per-group

MariaDB Query is way too slow


today I ran into a performance issue with a MariaDB query for this scenario: I have a table called stations where gas stations are saved (around 17k entries) with post_code, uuid, name, and other info that aren't used in this case. Then there's another table called prices (~210Mio entries) where a single entry represents a single price change for a single station with its specific date time and station_uuid.

First I needed to retrieve the min price for diesel in a 30-minute interval of a single day. This works fine. But then I needed to add the corresponding stations' names of the station the price change happened. So I added the specific join Logic but then remembered that it's not guaranteed to use the correct station name when grouping by time. So I tried to use a subquery but that's resulting in a timeout error.

I'm using single-column BTREE indexes on prices.diesel, prices.date, stations.post_code.

Here are my so far tried SQL Queries:

  1. Without name:
SELECT 
  MIN(diesel) AS "Minimal",
  date - interval minute(date)%30 minute AS time
FROM prices 
LEFT JOIN stations
  ON prices.station_uuid = stations.uuid
WHERE `date` BETWEEN (SELECT FROM_UNIXTIME(1684627200)) AND (SELECT FROM_UNIXTIME(1684713599)) 
    AND diesel > 0.0 
    AND post_code = 59929
GROUP BY date_format(time, '%Y-%m-%d %H:%i');
  1. With name:
SELECT 
  MIN(diesel) AS "Minimal",
  name AS "Tankstelle"
  date - interval minute(date)%30 minute AS time
FROM prices 
LEFT JOIN stations
  ON prices.station_uuid = stations.uuid
WHERE `date` BETWEEN (SELECT FROM_UNIXTIME(1684627200)) AND (SELECT FROM_UNIXTIME(1684713599)) 
    AND diesel > 0.0 
    AND post_code = 33106
GROUP BY date_format(time, '%Y-%m-%d %H:%i');
  1. With Subquery:
SELECT 
  stations.name AS "Tankstelle",
  date - interval minute(date)%30 minute as time
FROM prices 
LEFT JOIN stations
  ON prices.station_uuid = stations.uuid
WHERE `date` BETWEEN (SELECT FROM_UNIXTIME(1684627200)) AND (SELECT FROM_UNIXTIME(1684713599)) 
    AND diesel > 0.0 
    AND post_code = 59929
    AND diesel = (
      SELECT MIN(diesel) as Min 
      FROM prices p
      WHERE p.date BETWEEN (prices.date - interval minute(prices.date)%30 minute) AND (prices.date - interval ((minute(prices.date)%30)+30) minute))

EDIT to answer first three comments (23.05.2023):

MariaDB version: 10.11.3-MariaDB-1:10.11.3+maria~ubu2204

"diesel" is the price of diesel but if it is a change for e10 then the diesel will be 0 so to avoid using those rows I filter for 0 values

May sound a bit ignorant, but I don't know exactly how "with ties" works, so I don't have an opinion yet. And the clustered key is the primary key, isn't it? If yes it should be answered with the table definitions. If not then it sounds even more ignorant since you would need to explain to me how it differs from the primary key.

Thanks for the help in advance.

The result of EXPLAIN is the following:

Query 1:

+------+-------------+----------+--------+-------------------+---------+---------+------------------------------+--------+---------------------------------------------------------------------+
| id   | select_type | table    | type   | possible_keys     | key     | key_len | ref                          | rows   | Extra                                                               |
+------+-------------+----------+--------+-------------------+---------+---------+------------------------------+--------+---------------------------------------------------------------------+
|    1 | SIMPLE      | prices   | range  | date,diesel       | date    | 5       | NULL                         | 695600 | Using index condition; Using where; Using temporary; Using filesort |
|    1 | SIMPLE      | stations | eq_ref | PRIMARY,post_code | PRIMARY | 16      | tankonix.prices.station_uuid | 1      | Using where                                                         |
+------+-------------+----------+--------+-------------------+---------+---------+------------------------------+--------+---------------------------------------------------------------------+

Query 2:

+------+-------------+----------+--------+-------------------+---------+---------+------------------------------+--------+---------------------------------------------------------------------+
| id   | select_type | table    | type   | possible_keys     | key     | key_len | ref                          | rows   | Extra                                                               |
+------+-------------+----------+--------+-------------------+---------+---------+------------------------------+--------+---------------------------------------------------------------------+
|    1 | SIMPLE      | prices   | range  | date,diesel       | date    | 5       | NULL                         | 695600 | Using index condition; Using where; Using temporary; Using filesort |
|    1 | SIMPLE      | stations | eq_ref | PRIMARY,post_code | PRIMARY | 16      | tankonix.prices.station_uuid | 1      | Using where                                                         |
+------+-------------+----------+--------+-------------------+---------+---------+------------------------------+--------+---------------------------------------------------------------------+

Query 3: explain result

+------+--------------------+----------+--------+-------------------+---------+---------+------------------------------+-----------+------------------------------------+
| id   | select_type        | table    | type   | possible_keys     | key     | key_len | ref                          | rows      | Extra                              |
+------+--------------------+----------+--------+-------------------+---------+---------+------------------------------+-----------+------------------------------------+
|    1 | PRIMARY            | prices   | range  | date,diesel       | date    | 5       | NULL                         | 695600    | Using index condition; Using where |
|    1 | PRIMARY            | stations | eq_ref | PRIMARY,post_code | PRIMARY | 16      | tankonix.prices.station_uuid | 1         | Using where                        |
|    4 | DEPENDENT SUBQUERY | p        | ALL    | date              | NULL    | NULL    | NULL                         | 207323157 | Using where                        |
+------+--------------------+----------+--------+-------------------+---------+---------+------------------------------+-----------+------------------------------------+

My Table Definitions:

prices: table prices indexes prices

+--------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table  | Create Table



                                                     |
+--------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| prices | CREATE TABLE `prices` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `date` datetime NOT NULL,
  `station_uuid` text NOT NULL,
  `diesel` float NOT NULL,
  `e10` float NOT NULL,
  `e5` float NOT NULL,
  `dieselchange` tinyint(4) NOT NULL,
  `e5change` tinyint(4) NOT NULL,
  `e10change` tinyint(4) NOT NULL,
  PRIMARY KEY (`id`),
  KEY `date` (`date`) USING HASH,
  KEY `e5` (`e5`),
  KEY `e10` (`e10`),
  KEY `diesel` (`diesel`)
) ENGINE=InnoDB AUTO_INCREMENT=381367318 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci |
+--------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

stations: table stations indexes stations

+----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Table    | Create Table




                    |
+----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| stations | CREATE TABLE `stations` (
  `uuid` uuid NOT NULL,
  `name` varchar(127) NOT NULL,
  `brand` varchar(127) DEFAULT NULL,
  `street` varchar(127) NOT NULL,
  `house_number` varchar(7) NOT NULL DEFAULT '',
  `post_code` varchar(5) NOT NULL,
  `city` varchar(31) NOT NULL,
  `latitude` double NOT NULL,
  `longitude` double NOT NULL,
  `openingtimes_json` longtext CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL CHECK (json_valid(`openingtimes_json`)),
  `first_active` date NOT NULL,
  PRIMARY KEY (`uuid`),
  KEY `post_code` (`post_code`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_general_ci |
+----------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

EDIT (09.06.2023):

I succeeded in Optimizing it enough. First I completed the steps mentioned in the marked answer, then it went much better but not fast enough so I proceeded to try different approaches and succeeded with:

SELECT 
  stations.name AS "Tankstelle",
  date - interval minute(date)%30 minute as time_begin,
  date - interval minute(date)%30 - 30 minute as time_end
FROM stations 
JOIN prices
  ON prices.station_uuid = stations.uuid
JOIN (SELECT MIN(diesel) as price,
        date_format(date - interval minute(date)%30 minute, "%Y-%m-%d %H:%i:59") as time
      FROM prices p
      JOIN stations
        ON stations.uuid = p.station_uuid
      WHERE p.date BETWEEN FROM_UNIXTIME(1684627200) AND FROM_UNIXTIME(1684713599)
        AND diesel > 0
        AND stations.post_code = 59929
      GROUP BY date_format(time, '%Y-%m-%d %H:%i')) as min
    ON min.time Between (date - interval minute(date)%30 minute) and (date - interval minute(date)%30 - 30 minute)
WHERE `date` BETWEEN FROM_UNIXTIME(1684627200) AND FROM_UNIXTIME(1684713599)
    AND diesel > 0.0 
    AND post_code = 59929;

like this it gives me the name of the station corresponding to the cheapest diesel price in the given city grouped for each half an hour.


Solution

  • BETWEEN  (  SELECT  FROM_UNIXTIME(1684627200) )  AND ...
    

    -->

    BETWEEN FROM_UNIXTIME(1684627200) AND ...
    

    That is, simply use the expression; you don't need to SELECT it.

    Since the WHERE clause tests columns from different tables, the Optimizer will pick one of the tables (p vs s) to start with. It may goof.

    Since there is a test on s.PostCode in the WHERE, the LEFT JOIN is really a JOIN. Please change that to help humans read the query. Being a JOIN opens the possibility of starting with s instead of being forced to start with p.

                FROM  prices AS p
                JOIN  stations AS s  ON s.uuid = p.station_uuid
                WHERE  p.date BETWEEN FROM_UNIXTIME(1684627200)
                                  AND FROM_UNIXTIME(1684713599)
                  AND  p.diesel > 0.0
                  AND  s.post_code = 1 
    

    Starting with s:

    s: (you effectively have this)
       INDEX(post_code,       -- because of =
             uuid)            -- for "covering"
    p: (you have this)
       INDEX(station_uuid,    -- =
             date,            -- assuming it is more selective than diesel
             diesel)          -- covering
    

    Starting with p:

    p: INDEX(date, diesel, station_uuid)  -- covering, but otherwise unexciting
    s: the PRIMARY KEY(uuid) will be used effectively.
    

    Recommendation (after changing to JOIN): the second p index.

    Notes:

    • The Optimizer will pick between starting with s and p depending on crude statistics it has.
    • Each secondary index implicitly includes a copy of the PRIMARY KEY (so it can get to the rest of the rows). Hence (post_code, uuid) is the same as (post_code), which you already have.
    • I have not found that OVER(PARTITION BY...) has any magical performance, so I worry that that will be the slowest part of the query.
    • My Groupwise-Max found that it is not the fastest, but does avoid showing duplicates, which the Uncorrelated Subquery suffers from.