Search code examples
sqlconcatenationsubsetbitwise-operatorsexists

Remove subset rows of other rows in SQL


I've this table below that includes ID, and five indicator columns: x1, ..., x5:

enter image description here

I need to remove duplicates based on this logic:

For each ID, we look at the values of x1, ..., x5, and we remove the ones that are subset of other row. For example, for ID=1, row #3 is a subset of row #2, so we remove row #3. Also, row #4 is NOT a subset of row #2, so we keep it.

Here is the expected final view of the table:

enter image description here


Solution

  • First concatenate all values of the 5 columns so that you get a binary string like '100101' which can be converted to a base 10 number, say aliased value, with the function CONV().

    Assuming there are no duplicate rows in the table as you mention in your comment, a row #X should be considered a subset of another row #Y if the result of the bitwise OR between the values of the 2 rows returns the value of #Y:

    SELECT t1.*
    FROM tablename t1
    WHERE NOT EXISTS (
      SELECT *
      FROM tablename t2
      WHERE t2.ID = t1.ID 
        AND (t1.x1, t1.x2, t1.x3, t1.x4, t1.x5) <>
            (t2.x1, t2.x2, t2.x3, t2.x4, t2.x5)
        AND CONV(CONCAT(t1.x1, t1.x2, t1.x3, t1.x4, t1.x5), 2, 10) |
            CONV(CONCAT(t2.x1, t2.x2, t2.x3, t2.x4, t2.x5), 2, 10) =
            CONV(CONCAT(t2.x1, t2.x2, t2.x3, t2.x4, t2.x5), 2, 10)
    );
    

    Or, for MySql 8.0+:

    WITH cte AS (
      SELECT *, CONV(CONCAT(x1, x2, x3, x4, x5), 2, 10) value
      FROM tablename
    )
    SELECT t1.ID, t1.x1, t1.x2, t1.x3, t1.x4, t1.x5
    FROM cte t1
    WHERE NOT EXISTS (
      SELECT *
      FROM cte t2
      WHERE t2.ID = t1.ID 
        AND t2.value <> t1.value
        AND t1.value | t2.value = t2.value
    );
    

    If you want to delete the subset rows, use a self join of the table like this:

    DELETE t1
    FROM tablename t1 INNER JOIN tablename t2
    ON t2.ID = t1.ID
    AND (t1.x1, t1.x2, t1.x3, t1.x4, t1.x5) <>
        (t2.x1, t2.x2, t2.x3, t2.x4, t2.x5)
    AND CONV(CONCAT(t1.x1, t1.x2, t1.x3, t1.x4, t1.x5), 2, 10) |
        CONV(CONCAT(t2.x1, t2.x2, t2.x3, t2.x4, t2.x5), 2, 10) =
        CONV(CONCAT(t2.x1, t2.x2, t2.x3, t2.x4, t2.x5), 2, 10);
    

    See the demo.