UPDATE - FINAL SOLUTION TO THIS ISSUE
Our dynamic system allows for a BOOLEAN interpolated match of things like Name, Job Title, Phone Number, etc. So we can say:
Name("ted" OR "mike" OR "david" AND "martin") AND Title("developer" AND "senior" NOT "CTO) AND City("san diego")
The way this is accomplished is to follow the below grouping example, which is dynamically created. It's pretty straightforward, however the use of HAVING COUNT is necessary to properly define the AND indexes.
Also not in this example access_indexes is a list of ID indexes an account has access to, so if the "search" returns a person the account can't access, it won't show up.
Thanks to everyone for your help, especially @BillKarwin!
WITH filter0 AS
(
SELECT pm.ID FROM person_main pm
WHERE MATCH(pm.name_full) AGAINST ('(ted)' IN BOOLEAN MODE)
),
filter1 AS
(
SELECT ram.object_ref_id AS ID
FROM ras_assignment_main ram
WHERE ram.object_type_c = 1
AND ram.assignment_type_c = 1
AND ram.assignment_ref_id IN (2)
),
persongroup0_and AS
(
SELECT pg0_a.ID FROM
(
SELECT ID FROM filter0
) pg0_a
GROUP BY pg0_a.ID
HAVING COUNT(pg0_a.ID) = 1
),
persongroup0 AS
(
SELECT pm.ID
FROM person_main pm
JOIN persongroup0_and pg0_and ON pm.ID = pg0_and.ID
),
persongroup1_and AS
(
SELECT pg1_a.ID FROM
(
SELECT ID FROM filter1
) pg1_a
GROUP BY pg1_a.ID
HAVING COUNT(pg1_a.ID) = 1
),
persongroup1 AS
(
SELECT pm.ID
FROM person_main pm
JOIN persongroup1_and pg1_and ON pm.ID = pg1_and.ID
),
person_all_and AS
(
SELECT paa.ID FROM
(
SELECT ID FROM persongroup0
UNION ALL (SELECT ID FROM persongroup1)
) paa
GROUP BY paa.ID
HAVING COUNT(paa.ID) = 2
),
person_all AS
(
SELECT pm.ID
FROM person_main pm
JOIN person_all_and pa_and ON pm.ID = pa_and.ID
),
person_access AS
(
SELECT pa.ID
FROM person_all pa
LEFT JOIN access_indexes ai ON pa.ID = ai.ID
)
SELECT (JSON_ARRAYAGG(pm.ID))
FROM
(
SELECT person_sort.ID
FROM
(
SELECT pa.ID
FROM person_access pa
GROUP BY pa.ID
) person_sort
) pm;
Our front-end system has the ability to define dynamic SQL queries using AND/OR/NOT from multiple tables, and the core system works fine - but it's slows down to being unusable due to the compounded scanning of IN. For the life of me, I can't figure out how to have this level of dynamic functionality without using IN. Below is the code that works perfectly fine (the filter matching is ultra fast), but the compounding of the IN scan takes > 60 seconds because it's 50,000+ records for some of the filter returns.
WITH filter0 AS
(
SELECT pm.ID FROM person_main pm
WHERE MATCH(pm.name_full) AGAINST ('mike meyers' IN BOOLEAN MODE)
),
filter1 AS
(
SELECT phw.person_main_ref_id AS ID
FROM person_history_work phw
WHERE MATCH(phw.work_title) AGAINST('developer' IN BOOLEAN MODE)
),
filter2 AS
(
SELECT pa.person_main_ref_id AS ID
FROM person_address pa
WHERE pa.global_address_details_ref_id IN
(
SELECT gad.ID
FROM global_address_details gad
WHERE gad.address_city LIKE '%seattle%'
)
),
all_indexes AS
(
SELECT ID FROM filter0
UNION (SELECT ID FROM filter1)
UNION (SELECT ID FROM filter2)
),
person_filter AS
(
SELECT ai.ID
FROM all_indexes ai
WHERE
(
ai.ID IN (SELECT ID FROM filter0)
AND ai.ID NOT IN (SELECT ID FROM filter1)
OR ai.ID IN (SELECT ID FROM filter2)
)
)
SELECT (JSON_ARRAYAGG(pf.ID)) FROM person_filter pf;
Filter 0 has 461 records, Filter 1 has 48480 and Filter 2 has 750.
The key issue is with the WHERE statement; because the front-end can say AND/OR and NOT on any "joined" query.
So if I change it to:
ai.ID IN (SELECT ID FROM filter0)
AND ai.ID IN (SELECT ID FROM filter1)
AND ai.ID IN (SELECT ID FROM filter2)
The query takes more than 60 seconds. Because it's scanning 461 * 48480 * 750 = 16,761,960,00. UGH.
Of course I could hardcode around this if it was a static stored procedure or call, but it's a dynamic interpolative system that takes the settings defined by the user, so the user can define the above.
As you can see what I do is create a list of all indexes involved, then select them based on the AND/OR/NOT values as defined by the front-end web tool.
Obviously IN won't work for this; the question is what other techniques could I use that don't involve the use of IN that would allow the same level of flexibility with AND/OR/NOT?
Update for @BillKarwin in Comments
So the below code works well for executing an AND, NOT and OR:
SELECT pm.ID
FROM person_main pm
JOIN filter0 f0 ON f0.ID = pm.ID -- AND
LEFT JOIN filter1 f1 ON pm.ID = f1.ID WHERE f1.ID IS NULL -- NOT
UNION (SELECT ID FROM filter2) -- OR
I believe I can make this work with our system; I just need to store the different types (AND/NOT/OR) and execute them in process; let me do some updates and I'll get back to you.
As discussed in the comments above:
Logically, you can replace a lot of your subqueries with JOIN when they are AND terms of your expression, or UNION when they are OR terms of your expression. Also learn about exclusion joins.
But that doesn't necessarily mean that the queries will run faster, unless you have created indexes to support the join conditions and the user-defined conditions.
But which indexes should you create?
Ultimately, it's not possible to optimize all dynamic queries that users come up with. You may be able to run their queries (as you are already doing), but they won't be efficient.
It's kind of a losing game to allow users to specify arbitrary conditions. It's better to give them a fixed set of choices, which are types of queries that you have taken the time to optimize. Then allow them to run a "user-specified" query, but label it clearly that it is not optimized and it will likely take a long time.