Search code examples
sqlentity-attribute-value

Find similar entities in EAV model


I need a query that returns "similar" products in my EAV table that:

1) share at least one similar attribute

2) have no attributes different to the product

eg

ProductID Attribute Value
1         Prop1      1
1         Prop2      2
2         Prop1      1
3         Prop1      1
3         Prop2      3

Say in this example, searching for products similar to product id 1 (Prop1:1 and Prop2:2). Product 2 would be returned, because Prop1 is 1, but product 3 is not ok, because Prop2 is different.etc.

There is a variable number of attributes for each product, so joining the table for each attribute is not an option. At the moment I'm concatenating the list of props to build a dynamic SQL "where", but I can't find a good (fast?) SQL statement that will do this.

Maybe I've spent too much time focusing on this problem, but I can't shake the feeling that I'm missing an obvious way to do this...


Solution

  • When faced with this sort of problem, I use TDQD — Test-Driven Query Design.

    Please note that it helps everyone if you give your table a name!

    Pass 1

    List of products with one or more attributes identical to product 1

    SELECT a.ProductID, COUNT(*) AS matches
      FROM EAV_Table AS a
      JOIN EAV_Table AS b
        ON a.Attribute = b.Attribute AND a.value = b.value
     WHERE a.ProductID != 1
       AND b.ProductID  = 1
     GROUP BY a.ProductID
    

    This clearly won't list any products where the count is 0, which is fine.

    List of products with one or more attributes mismatching product 1

    SELECT c.ProductID, COUNT(*) AS matches
      FROM EAV_Table AS c
      JOIN EAV_Table AS d
        ON c.Attribute = d.Attribute AND c.value != d.value
     WHERE c.ProductID != 1
       AND d.ProductID  = 1
     GROUP BY c.ProductID
    

    This also won't list products where the count is 0, which is more of a nuisance.

    Result — Pass 1

    We need all the products from the first query where the product is not listed in the second query. That can be expressed by a NOT EXISTS and a correlated sub-query:

    SELECT a.ProductID, COUNT(*) AS matches
      FROM EAV_Table AS a
      JOIN EAV_Table AS b
        ON a.Attribute = b.Attribute AND a.value = b.value
     WHERE a.ProductID != 1
       AND b.ProductID  = 1
       AND NOT EXISTS
           (SELECT c.ProductID
              FROM EAV_Table AS c
              JOIN EAV_Table AS d
                ON c.Attribute = d.Attribute AND c.value != d.value
             WHERE c.ProductID != 1
               AND d.ProductID  = 1
               AND c.ProductID = a.ProductID
           )
     GROUP BY a.ProductID
    

    That's pretty ugly. It works, but it is ugly.

    Test Data

    CREATE TABLE eav_table
    (
        productid INTEGER NOT NULL,
        attribute CHAR(5) NOT NULL,
        value INTEGER NOT NULL,
        PRIMARY KEY(productid, attribute, value)
    );
    
    INSERT INTO eav_table VALUES(1, "Prop1", 1);
    INSERT INTO eav_table VALUES(1, "Prop2", 2);
    INSERT INTO eav_table VALUES(2, "Prop1", 1);
    INSERT INTO eav_table VALUES(3, "Prop1", 1);
    INSERT INTO eav_table VALUES(3, "Prop2", 3);
    INSERT INTO eav_table VALUES(4, "Prop1", 1);
    INSERT INTO eav_table VALUES(4, "Prop3", 1);
    

    Q1 Result

    2    1
    3    1
    4    1
    

    Q2 Result

    3    1
    

    Q3 Result

    2    1
    4    1
    

    The ones are the counts I generated; a more polished rendition would remove them.


    Pass 2

    If it can be managed, a nicer final query would join a table that lists all the product IDs which have at least one matching attribute/value pair in common with product ID 1 with a table that lists all the product IDs which have zero disagreements with product ID 1.

    List of products with one or more attributes identical to product 1

    The first query is the same as the first query in Pass 1, except we'll drop the count in the result set.

    SELECT a.ProductID
      FROM EAV_Table AS a
      JOIN EAV_Table AS b
        ON a.Attribute = b.Attribute AND a.value = b.value
     WHERE a.ProductID != 1
       AND b.ProductID  = 1
     GROUP BY a.ProductID
    

    Either the GROUP BY clause or a DISTINCT in the select list is necessary, in general (though the sample data doesn't formally require it).

    List of products with zero attributes that mismatch product 1

    We will exploit the fact that COUNT(column) only counts non-null values, and use a LEFT OUTER JOIN.

    SELECT c.ProductID
      FROM      EAV_Table AS c
      LEFT JOIN EAV_Table AS d
        ON c.Attribute = d.Attribute
       AND c.Value != d.Value
       AND c.ProductID != 1
       AND d.ProductID  = 1
     GROUP BY c.ProductID
    HAVING COUNT(d.Value) == 0;
    

    Note that the WHERE clause has been merged into the ON clause; this is actually rather important.

    Result — Pass 2

    We build the two queries above as sub-queries that are joined to generate the final result:

    SELECT f.ProductID
      FROM (SELECT a.ProductID
              FROM EAV_Table AS a
              JOIN EAV_Table AS b
                ON a.Attribute = b.Attribute AND a.value = b.value
             WHERE a.ProductID != 1
               AND b.ProductID  = 1
             GROUP BY a.ProductID
           ) AS e
      JOIN (SELECT c.ProductID
              FROM      EAV_Table AS c
              LEFT JOIN EAV_Table AS d
                ON c.Attribute = d.Attribute
               AND c.Value != d.Value
               AND c.ProductID != 1
               AND d.ProductID  = 1
             GROUP BY c.ProductID
            HAVING COUNT(D.Value) = 0
           ) AS f
        ON e.ProductID = f.ProductID
    

    This produces the answer 2 and 4 on the sample data.

    Note that one part of this exercise is learning not to be satisfied with the first answer you develop. Note that it would be good to benchmark the solutions on full size data sets instead of test data sets with only 7 rows in the table.