I've read the others related questions but mine is unique because of it's structure.
My App is storing about 10,000+ users with profiles defined by many parameters (gender, weight, height, hair color, eye color, dancing skills... and so on, about 100 attributes, lets say).
The application construct a Filter Form with these attributes. User is filtering database using this form so constructs a Query with many sub-queries, one for each filter used.
The problem is that using more than 8-9 filters, the engine will crash into a very long response (I had to kill the process after 30m of waiting).
So, this is the structure of Database
Table def_attributes (here are the attributes definitions)
Table utilizatori (users definitions, only column activ is used now)
Table val_atribute (storing the values of attributes for each user)
For example, here is a query constructed by the Filtering form, that lags out:
SELECT DISTINCT Q1.user_id
FROM (SELECT DISTINCT val_atribute.user_id
FROM val_atribute
WHERE attr_id = 45
AND attr_value IN ( 'Actor', 'Actor Amator' )) Q1
INNER JOIN (SELECT DISTINCT val_atribute.user_id
FROM val_atribute
WHERE attr_id = 46
AND Floor(Datediff(Curdate(), attr_value) / 365) >= '20') Q2
ON Q1.user_id = Q2.user_id
INNER JOIN (SELECT DISTINCT val_atribute.user_id
FROM val_atribute
WHERE attr_id = 46
AND Floor(Datediff(Curdate(), attr_value) / 365) <= '50') Q3
ON Q2.user_id = Q3.user_id
INNER JOIN (SELECT DISTINCT val_atribute.user_id
FROM val_atribute
WHERE attr_id = 47
AND attr_value IN ( 'feminin', 'masculin' )) Q4
ON Q3.user_id = Q4.user_id
INNER JOIN (SELECT DISTINCT val_atribute.user_id
FROM val_atribute
WHERE attr_id = 102
AND attr_value IN ( 'African', 'Asiatic', 'Caucazian', 'Metis' )) Q5
ON Q4.user_id = Q5.user_id
INNER JOIN (SELECT DISTINCT val_atribute.user_id
FROM val_atribute
WHERE attr_id = 103
AND attr_value >= 1) Q6
ON Q5.user_id = Q6.user_id
INNER JOIN (SELECT DISTINCT val_atribute.user_id
FROM val_atribute
WHERE attr_id = 103
AND attr_value <= 200) Q7
ON Q6.user_id = Q7.user_id
INNER JOIN (SELECT DISTINCT val_atribute.user_id
FROM val_atribute
WHERE attr_id = 104
AND attr_value >= 10) Q8
ON Q7.user_id = Q8.user_id
INNER JOIN (SELECT DISTINCT val_atribute.user_id
FROM val_atribute
WHERE attr_id = 104
AND attr_value <= 150) Q9
ON Q8.user_id = Q9.user_id
INNER JOIN (SELECT DISTINCT val_atribute.user_id
FROM val_atribute
WHERE attr_id = 107
AND attr_value IN ( 'Albastri', 'Caprui', 'Heterocrom', 'Verzi' )) Q10
ON Q9.user_id = Q10.user_id
INNER JOIN (SELECT DISTINCT val_atribute.user_id
FROM val_atribute
WHERE attr_id = 108
AND attr_value IN ( 'Blond', 'Brunet', 'Castaniu', 'Roscat', 'Saten' )) Q11
ON Q10.user_id = Q11.user_id
INNER JOIN (SELECT DISTINCT val_atribute.user_id
FROM val_atribute
WHERE attr_id = 109
AND attr_value IN ( 'Calvitie', 'Lung', 'Mediu', 'Scurt', 'Zero' )) Q12
ON Q11.user_id = Q12.user_id
INNER JOIN (SELECT DISTINCT utilizatori.id
FROM utilizatori
WHERE activ = 1) Q13
ON Q12.user_id = Q13.id
GROUP BY user_id
Q2 is calculating AGE becuase weonly have attribute [Date of birth] and filter Q2 wants age > 20.
The last Query (here Q13) is always mathing active user from Table utilizatori.
I think is a problem with cartesian progression but Question: how can I remake the query to make it way more faster? Thank you very much!
Edit / Problem solved:
Following Gordon Linoff's great help, I've constructed the correct query, using same filters:
SELECT u.id
FROM utilizatori u
WHERE EXISTS (SELECT 1
FROM val_atribute va
WHERE va.user_id = u.id
AND va.attr_id = 45
AND attr_value IN ( 'Actor', 'Actor Amator' ))
AND EXISTS (SELECT 1
FROM val_atribute va
WHERE va.user_id = u.id
AND va.attr_id = 46
AND Floor(Datediff(Curdate(), attr_value) / 365) >= 20)
AND EXISTS (SELECT 1
FROM val_atribute va
WHERE va.user_id = u.id
AND va.attr_id = 46
AND Floor(Datediff(Curdate(), attr_value) / 365) <= 50)
AND EXISTS (SELECT 1
FROM val_atribute va
WHERE va.user_id = u.id
AND va.attr_id = 47
AND attr_value IN ( 'feminin', 'masculin' ))
AND EXISTS (SELECT 1
FROM val_atribute va
WHERE va.user_id = u.id
AND va.attr_id = 102
AND attr_value IN ( 'African', 'Asiatic', 'Caucazian', 'Metis' ))
AND EXISTS (SELECT 1
FROM val_atribute va
WHERE va.user_id = u.id
AND va.attr_id = 103
AND attr_value >= 1)
AND EXISTS (SELECT 1
FROM val_atribute va
WHERE va.user_id = u.id
AND va.attr_id = 103
AND attr_value <= 200)
AND EXISTS (SELECT 1
FROM val_atribute va
WHERE va.user_id = u.id
AND va.attr_id = 104
AND attr_value >= 10)
AND EXISTS (SELECT 1
FROM val_atribute va
WHERE va.user_id = u.id
AND va.attr_id = 104
AND attr_value <= 150)
AND EXISTS (SELECT 1
FROM val_atribute va
WHERE va.user_id = u.id
AND va.attr_id = 107
AND attr_value IN ( 'Albastri', 'Caprui', 'Heterocrom', 'Verzi' ))
AND EXISTS (SELECT 1
FROM val_atribute va
WHERE va.user_id = u.id
AND va.attr_id = 108
AND attr_value IN ( 'Blond', 'Brunet', 'Castaniu', 'Roscat', 'Saten' ))
AND EXISTS (SELECT 1
FROM val_atribute va
WHERE va.user_id = u.id
AND va.attr_id = 109
AND attr_value IN ( 'Calvitie', 'Lung', 'Mediu', 'Scurt', 'Zero' ))
AND activ = 1
Now query takes about 0.0015 seconds to run.
Subqueries in MySQL are problematic -- and the select distinct
makes things worse. You are connecting the subqueries using and
. I would suggest constructing the same logic using exists
instead.
So:
select u.*
from users u
where exists (select 1
from val_atribute va
where va.user_id = u.user_id and
va.attr_id = 45 and
va.attr_value in ( 'Actor', 'Actor Amator' )
) and
exists (select 1
from val_atribute va
where va.user_id = u.user_id and
va.attr_id = 46 and
Floor(Datediff(Curdate(), va.attr_value) / 365) >= 20) Q2
) and
. . .
This version of the query can take advantage of an index on val_attribute(user_id, attr_id, attr_value)
. It should be much faster and have better scalability.