I need a query that returns "similar" products in my EAV table that:
1) share at least one similar attribute
2) have no attributes different to the product
eg
ProductID Attribute Value
1 Prop1 1
1 Prop2 2
2 Prop1 1
3 Prop1 1
3 Prop2 3
Say in this example, searching for products similar to product id 1 (Prop1:1 and Prop2:2). Product 2 would be returned, because Prop1 is 1, but product 3 is not ok, because Prop2 is different.etc.
There is a variable number of attributes for each product, so joining the table for each attribute is not an option. At the moment I'm concatenating the list of props to build a dynamic SQL "where", but I can't find a good (fast?) SQL statement that will do this.
Maybe I've spent too much time focusing on this problem, but I can't shake the feeling that I'm missing an obvious way to do this...
When faced with this sort of problem, I use TDQD — Test-Driven Query Design.
Please note that it helps everyone if you give your table a name!
SELECT a.ProductID, COUNT(*) AS matches
FROM EAV_Table AS a
JOIN EAV_Table AS b
ON a.Attribute = b.Attribute AND a.value = b.value
WHERE a.ProductID != 1
AND b.ProductID = 1
GROUP BY a.ProductID
This clearly won't list any products where the count is 0, which is fine.
SELECT c.ProductID, COUNT(*) AS matches
FROM EAV_Table AS c
JOIN EAV_Table AS d
ON c.Attribute = d.Attribute AND c.value != d.value
WHERE c.ProductID != 1
AND d.ProductID = 1
GROUP BY c.ProductID
This also won't list products where the count is 0, which is more of a nuisance.
We need all the products from the first query where the product is not listed in the second query. That can be expressed by a NOT EXISTS and a correlated sub-query:
SELECT a.ProductID, COUNT(*) AS matches
FROM EAV_Table AS a
JOIN EAV_Table AS b
ON a.Attribute = b.Attribute AND a.value = b.value
WHERE a.ProductID != 1
AND b.ProductID = 1
AND NOT EXISTS
(SELECT c.ProductID
FROM EAV_Table AS c
JOIN EAV_Table AS d
ON c.Attribute = d.Attribute AND c.value != d.value
WHERE c.ProductID != 1
AND d.ProductID = 1
AND c.ProductID = a.ProductID
)
GROUP BY a.ProductID
That's pretty ugly. It works, but it is ugly.
CREATE TABLE eav_table
(
productid INTEGER NOT NULL,
attribute CHAR(5) NOT NULL,
value INTEGER NOT NULL,
PRIMARY KEY(productid, attribute, value)
);
INSERT INTO eav_table VALUES(1, "Prop1", 1);
INSERT INTO eav_table VALUES(1, "Prop2", 2);
INSERT INTO eav_table VALUES(2, "Prop1", 1);
INSERT INTO eav_table VALUES(3, "Prop1", 1);
INSERT INTO eav_table VALUES(3, "Prop2", 3);
INSERT INTO eav_table VALUES(4, "Prop1", 1);
INSERT INTO eav_table VALUES(4, "Prop3", 1);
2 1
3 1
4 1
3 1
2 1
4 1
The ones are the counts I generated; a more polished rendition would remove them.
If it can be managed, a nicer final query would join a table that lists all the product IDs which have at least one matching attribute/value pair in common with product ID 1 with a table that lists all the product IDs which have zero disagreements with product ID 1.
The first query is the same as the first query in Pass 1, except we'll drop the count in the result set.
SELECT a.ProductID
FROM EAV_Table AS a
JOIN EAV_Table AS b
ON a.Attribute = b.Attribute AND a.value = b.value
WHERE a.ProductID != 1
AND b.ProductID = 1
GROUP BY a.ProductID
Either the GROUP BY clause or a DISTINCT in the select list is necessary, in general (though the sample data doesn't formally require it).
We will exploit the fact that COUNT(column)
only counts non-null values, and use a LEFT OUTER JOIN.
SELECT c.ProductID
FROM EAV_Table AS c
LEFT JOIN EAV_Table AS d
ON c.Attribute = d.Attribute
AND c.Value != d.Value
AND c.ProductID != 1
AND d.ProductID = 1
GROUP BY c.ProductID
HAVING COUNT(d.Value) == 0;
Note that the WHERE clause has been merged into the ON clause; this is actually rather important.
We build the two queries above as sub-queries that are joined to generate the final result:
SELECT f.ProductID
FROM (SELECT a.ProductID
FROM EAV_Table AS a
JOIN EAV_Table AS b
ON a.Attribute = b.Attribute AND a.value = b.value
WHERE a.ProductID != 1
AND b.ProductID = 1
GROUP BY a.ProductID
) AS e
JOIN (SELECT c.ProductID
FROM EAV_Table AS c
LEFT JOIN EAV_Table AS d
ON c.Attribute = d.Attribute
AND c.Value != d.Value
AND c.ProductID != 1
AND d.ProductID = 1
GROUP BY c.ProductID
HAVING COUNT(D.Value) = 0
) AS f
ON e.ProductID = f.ProductID
This produces the answer 2 and 4 on the sample data.
Note that one part of this exercise is learning not to be satisfied with the first answer you develop. Note that it would be good to benchmark the solutions on full size data sets instead of test data sets with only 7 rows in the table.