The goal is to query data with logic based on a pivot table. I am wanting to support 6 logic gates, [AND, OR, XOR, NAND, NOR, XNOR]
, so that when a user creates a custom filter they can provide the logic gate and the tags.id
values to use in the logic.
Example data:
+----------+
| data |
|----+-----+
| id | ... |
+----+-----+
| s | ... |
| t | ... |
| u | ... |
| v | ... |
| w | ... |
| x | ... |
| y | ... |
| z | ... |
+----+-----+
+---------+
| pivot |
|----+----|
| c1 | c2 |
+----+----+
| t | a |
| t | b |
| t | c |
| u | a |
| u | b |
| v | b |
| v | c |
| w | a |
| w | c |
| x | a |
| y | b |
| z | c |
+----+----+
+----------+
| tags |
+----+-----+
| id | ... |
+----+-----+
| a | ... |
| b | ... |
| c | ... |
| d | ... |
+----+-----+
Expected outputs:
AND(a, b, c)
= [t]
OR(a, b, c)
= [t, u, v, w, x, y, z]
XOR(a, b, c)
= [u, v, w, x, y, z]
NAND(a, b, c)
= [s, u, v, w, x, y, z]
NOR(a, b, c)
= [s]
XNOR(a, b, c)
= [s, t]
The query representations:
AND(a, b, c)
= SELECT * FROM data WHERE id IN (SELECT c1 FROM pivot WHERE c2='a') AND id IN (SELECT c1 FROM pivot WHERE c2='b') AND id IN (SELECT c1 FROM pivot WHERE c2='c')
OR(a, b, c)
= SELECT * FROM data WHERE id IN (SELECT c1 FROM pivot WHERE c2 IN ('a', 'b', 'c'))
XOR(a, b, c)
= SELECT * FROM data WHERE id IN (SELECT c1 FROM pivot WHERE c2 IN ('a', 'b', 'c')) AND !(id IN (SELECT c1 FROM pivot WHERE c2='a') AND id IN (SELECT c1 FROM pivot WHERE c2='b') AND id IN (SELECT c1 FROM pivot WHERE c2='c'))
NAND(a, b, c)
= SELECT * FROM data WHERE !(id IN (SELECT c1 FROM pivot WHERE c2='a') AND id IN (SELECT c1 FROM pivot WHERE c2='b') AND id IN (SELECT c1 FROM pivot WHERE c2='c'))
NOR(a, b, c)
= SELECT * FROM data WHERE id NOT IN (SELECT c1 FROM pivot WHERE c2 IN ('a', 'b', 'c'))
XNOR(a, b, c)
= SELECT * FROM data WHERE id NOT IN (SELECT c1 FROM pivot WHERE c2 IN ('a', 'b', 'c')) OR (id IN (SELECT c1 FROM pivot WHERE c2='a') AND id IN (SELECT c1 FROM pivot WHERE c2='b') AND id IN (SELECT c1 FROM pivot WHERE c2='c'))
data is currently ~30K rows and tags is ~1.5K rows. I am currently populating this pivot table with 100K rows to do some testing with the above queries. I feel like the multiple IN()
statements will make this slow.
Can these queries be fine tuned with Views, Joins, or other MySQL operations?
Also if you have suggestions for a better way to structure this, I am all ears. I have previously tried to use JSON fields in data to avoid using a pivot table, but that turned out to be extremely slow.
EDIT: While the OR(...)
and NOR(...)
queries work really well (~80ms), the AND(...)
query performs poorly (~1300ms). Looking at the EXPLAIN
and trying to follow MySQL subquery opimization suggestions to generate a better single subquery using DISTINCT ... INNER JOIN
actually made matters worse.
What I did stumble upon through testing is that I can do a faster intersection of multiple lists of ids in Node with lodash's _.intersection(...)
than I can achieve in pure MySQL.
So instead of using subqueries to form the AND(...)
logic, I am able to pull the separate sub queries from MySQL and then in the API itself, use lodash to perform intersection and then generate a single list to use in the final IN(...)
statement to do final filtering.
OR(a, b, c) = [t, u, v, w, x, y, z]
SELECT GROUP_CONCAT(DISTINCT c1)
FROM pivot
WHERE p2 IN ('a', 'b', 'c');
The rest are messier, so let me ask how many distinct c1's and c2's can exist. If it is no more than 64, you can do Boolean arithmetic on bits in a BIGINT UNSIGNED
.
See also the datatypes ENUM
and SET
.
If you have MySQL 8.0, Boolean operators work on BLOB
, thereby going far past 64.
VIEWs
are syntactic sugar, not a performance tool.
IN(SELECT ...)
is often very inefficient; try to avoid it by either using EXISTS( SELECT 1 ... )
or JOIN
(or LEFT JOIN
).
AND(a, b, c) = [t]
is can be achieved something like this:
SELECT GROUP_CONCAT(DISTINCT c1)
FROM (
SELECT c1
FROM pivot
WHERE c2 IN ('a', 'b', 'c')
GROUP BY c1
HAVING COUNT(*) = 3 -- the number of items in a,b,c
) AS x ;
Note:
c2 IN ('a', 'b', 'c')
can be written thus:
FIND_IN_SET(c2, 'a,b,c')
which might it easier to turn your operators into Stored Routines.