Search code examples
mysqlsqlperformancesql-optimizationsqlperformance

Optimizing DISTINCT SQL query with OR conditions


I have the following SQL query:

SELECT DISTINCT business_key
FROM Memory
WHERE concept <> 'case' OR attrib <> 'status' OR value <> 'closed'

What I try to achieve is to get all unique business keys that don't have a record concept=case AND attrib=status AND value=closed. Running this query in MySQL with 500 000 records with all unique business_keys is very slow: about 11 seconds.

I placed indices to the business_key column, to the concept, attrib and value columns. I also tried with a combined index to all three columns (concept, attrib, value) but the result is the same.

Here is a screenshot of the EXPLAIN EXTENDED command:

enter image description here

The interesting thing is that running the query without the distinct specifier results in a very fast execution.

I had also tried this:

SELECT DISTINCT m.business_key
FROM Memory m 
WHERE m.business_key NOT IN 
(SELECT c.business_Key 
 FROM Memory c 
 WHERE c.concept = 'case' AND c.attrib = 'status' AND c.value = 'closed')

with even worse results: around 25 seconds


Solution

  • You could add a compound (concept, attrib, value, business_key) index so the query (if MySQL decides to use this index) can find all the info in the index without having to read the whole table.

    Your query is equivalent to:

    SELECT DISTINCT business_key
    FROM Memory
    WHERE NOT (concept = 'case' AND attrib = 'status' AND value = 'closed')
    

    and to this (which will probably yield the same execution plan):

    SELECT business_key
    FROM Memory
    WHERE NOT (concept = 'case' AND attrib = 'status' AND value = 'closed')
    GROUP BY business_key
    

    Since the 4 columns that are to be put in the index are all VARCHAR(255), the index length will be pretty large. MyISAM will not allow more than 1000 bytes and InnoDB no more than 3072.

    One solution is to cut the length of the last part, making the index length less than 1000: 255+255+255+230 = 995:

    (concept, attrib, value, business_key(220))

    It will work but it's really not good to have so large index lengths, performance wise.

    Another option is to lower the length of all or some of those 4 columns, if that complies with the data you expect to store there. No need to declare length 255 if you expect to have maximum of 100 in a column.

    Another option you may consider is putting those 4 columns in 4 separate reference tables. (Or just the columns that have repeated data. It seems that business_key will have duplicate data but not that many. So, it won't be much good to make a reference table for that column.)

    Example: Put concept values in a new table with something like:

    CREATE TABLE Concept_Ref
    ( concept_id INT AUTO_INCREMENT
    , concept VARCHAR(255)
    , PRIMARY KEY concept_id
    , UNIQUE INDEX concept_idx (concept) 
    ) ;
    
    INSERT INTO Concept_Ref
      ( concept )
    SELECT DISTINCT
        concept
    FROM
        Memory ;
    

    and then change the Memory table with:

    ALTER TABLE Memory
    ADD COLUMN concept_id INT ;
    

    do this (once):

    UPDATE 
        Memory m
      JOIN
        Concept_Ref c
          ON c.concept = m.concept
    SET m.concept_id = c.concept_id
    

    and then drop the Memory.concept column:

    ALTER TABLE Memory
    DROP COLUMN concept ;
    

    You can also add FOREIGN KEY references if you change your tables from MyISAM to InnoDB.

    After doing the same for all 4 columns, not only the length of new compound index in the Memory table will be much smaller but your table size will be much smaller, too. Additionally, any other index that uses any of those columns will have smaller length.

    Off course the query would need 4 JOINs to be written. And any INSERT, UPDATE or DELETE statement to this table will have to be changed and carefully designed.

    But overall, I think you will have better performance. With the design you have now, it seems that values like 'case', 'status' and 'closed' are repeated many times.