mysql database normalization database-normalization

Efficient MySQL Queries on a Normalized Table with Complex and/or Conditions

I have the following table:

create table x (
    id integer, 
    property integer
);

I need to efficiently run queries that test multiple property values for a given id. For instance, I may want a query that gets all ids with a property satisfying the condition: 1 and not (2 or 3 or 4 or 5).

If all my properties were in boolean columns ('t1' = true if property value 1 exists for id, false otherwise, etc...), then I can simply run this very fast query (assume y were such a table):

select * from y where t1 and not (t2 or t3 or t4 or t5);

Unfortunately, I have thousands of properties, so this won't do. Furthermore, there's no rhyme or reason as to the queries, so while I can bundle groups of properties into conceptual relations, the query boundaries don't respect that. Additionally, these queries are (indirectly) determined by the user, so creating views in anticipation of them won't help. Finally, we'll constantly be adding data sets with new properties whose relations may be new, vague or cross-cutting, meaning that trying to create tables of relations may become a maintenance nightmare.

Hence why I chose the original schema.

To try to accomplish my queries, I tried first creating a pivot on the fields involved in the query, then querying that:

create table pivot as (
    select 
        id,
        max(if(property=1,true,false)) as 't1',
        max(if(property=2,true,false)) as 't2',
        max(if(property=3,true,false)) as 't3',
        max(if(property=4,true,false)) as 't4',
        max(if(property=5,true,false)) as 't5'
    from x);
select * from pivot where t1 and not (t2 or t3 or t4 or t5);

However, this is very slow. In fact, it's slower than an un-optimized home-brewed solution.

I know I can produce complex queries with sub-queries, but a limited test suggested that performance would be even worse (unless I structured the query incorrectly).

What can I do to speed up my queries?

Solution

I assume that id is not unique and an existing record (some_id, property_id) means that the property is on.

First, I would notice that I need to efficiently run queries that test multiple property values for a given id and I may want a query that gets all ids with a property satisfying the condition: 1 and not (2 or 3 or 4 or 5) may lead to completely different queries.

But here is my idea. Some more assumptions:

there is always a positive condition that cuts away a significant part of ids (otherwise I would join several properties into an extra one that becomes such a condition)
any pair (id, property) is unique (an you even have the corresponding UNIQUE index)

Now, if you have an index on (property, id), then the following query will take all matching ids from the covering index (that is quickly):

SELECT id
FROM t1 
WHERE property = 150;

If this query leads to a significantly smaller result set than the whole table, you can afford to make another fast correlated subquery for another property that will significantly decrease the result set. This subquery will require another covering index (id, property) and the corresponding UNIQUE index is what it needs:

SELECT id
FROM t1 
WHERE property = 150 
  AND NOT EXISTS (
    SELECT 1
    FROM t2
    WHERE t2.id = t1.id AND t2.property = 130
  )
  AND NOT EXISTS (
    SELECT 1
    FROM t2
    WHERE t2.id = t1.id AND t2.property = 90
  );

If an earlier correlated subquery results to false, all the following subqueries will not be executed for the row. That is why the order is crucial.

You will need to play with the properties order, and probably hardcode that in the code that executes the query.

UPD: then, I afraid, you do not have much choice. The best you can do is to walk through the index in a single pass and compute what you need. The speed of the query then will mainly depend on the number of rows in your table. So, again assuming you have the UNIQUE index on (id, property), you can write something like:

SELECT id
FROM t1
GROUP BY id
HAVING 
  COUNT(IF(property=150, 1, NULL))
  AND NOT COUNT(IF(property=130, 1, NULL))
  AND NOT COUNT(IF(property=90, 1, NULL));