Search code examples
sqlmariadbpivot-tablequery-performance

MariaDB Pivot Table Performace


I have a table containing data with dynamic categories:

+----------+--------------+---------------+---------+
| category | string_value | integer_value | user_id |
+----------+--------------+---------------+---------+
| cat_1    | NULL         | 1             |       1 |
| cat_1    | NULL         | 3             |       2 |
| cat_2    | foo          | NULL          |       1 |
| cat_2    | bar          | NULL          |       2 |
+----------+--------------+---------------+---------+

I need a pivoted version of this table which I use with statement:

select
  user_id,
  max(case when category == 'cat_1' then integer_value end) as 'cat_1',
  max(case when category == 'cat_2' then string_value end) as 'cat_2',
from my_table
group by user_id

This creates results in the following format:

+---------+-------+-------+
| user_id | cat_1 | cat_2 |
+---------+-------+-------+
|       1 |     1 | foo   |
|       2 |     3 | bar   |
+---------+-------+-------+

This query by itself is performing well also for many categories and table entries (e.g., for 8 categories and 240k entries, it takes around 20ms), but if i wrap this exact query in a select * from <query>, performance drops significantly (to 650ms).

Also, ordering by user_id does not influence the performance significantly, whereas ordering by any other field also causes a performance drop, even if an index of the respective field and user_id exists. I'm guessing that this approach in itself is not feasible for larger tables? However, I am curious what causes the extra execution time when adding the select * from <query part.

Background: I try to use this query to store dynamic user data, and I'd like to prevent changes to the table structure at runtime (i.e., adding a column). Any alternative would be welcome. I'm using MariaDB 10.5.5, and I need the solution to also work with MySQL 5.7 and SQL Server 2019.

Execution Plans:

Without surrounding select * from:

+----+-------------+-----------+-------+---------------+------------+---------+-----+--------+---------+----------+------------+-------+    
| id | select_type | table     | type  | possible_keys | key        | key_len | ref | rows   | r_rows  | filtered | r_filtered | Extra |
|----|-------------|-----------|-------|---------------|------------|---------|-----|--------|---------|----------|------------|-------|
|  1 | SIMPLE      | user_data | index |               | user_index |         |   9 | 226067 | 1619.00 |    100.0 |      99.88 |       |
+----+-------------+-----------+-------+---------------+------------+---------+-----+--------+---------+----------+------------+-------+

With surrounding select * from:

+----+-------------+------------+-------+---------------+------------+---------+-----+--------+-----------+----------+------------+-------+ 
| id | select_type | table      | type  | possible_keys | key        | key_len | ref | rows   | r_rows    | filtered | r_filtered | Extra |
|----|-------------|------------|-------|---------------|------------|---------|-----|--------|-----------|----------|------------|-------|
|  1 | PRIMARY     | <derived2> | ALL   |               |            |         |     | 226067 |    200.00 |    100.0 |      100.0 |       |
|  2 | DERIVED     | user_data  | index |               | user_index |       9 |     | 226067 | 242418.00 |    100.0 |      100.0 |       |
+----+-------------+------------+-------+---------------+------------+---------+-----+--------+-----------+----------+------------+-------+ 

Solution

  • Here is my speculation on what is happening.

    You have an index on the underlying table which MariaDB uses for the aggregation. That means that no sorting is done . . . and just by reading the index, it can start returning rows.

    This is a very nice feature. But when you just run the query, you are seeing the time to first row.

    When you use a derived table, MariaDB must generate all the rows before returning any of them. So, the select * with a subquery is doing much more work.

    That is why the second version is slower than the first. I would expect a query that returns many tens of thousands of rows to take more than 20 milliseconds on most machines.