Search code examples
phpmysqldifflarge-data

Efficiently get diff of large data set?


I need to be able to diff the results of two queries, showing the rows that are in the "old" set but aren't in the "new"... and then showing the rows that are in the "new" set but not the old.

Right now, i'm pulling the results into an array, and then doing an array_diff(). But, i'm hitting some resource and timing issues, as the sets are close to 1 million rows each.

The schema is the same in both result sets (barring the setId number and the table's autoincrement number), so i assume there's a good way to do it directly in MySQL... but im not finding how.

Example Table Schema:
rowId,setId,userId,name

Example Data:
    1,1,user1,John
    2,1,user2,Sally
    3,1,user3,Tom
    4,2,user1,John
    5,2,user2,Thomas
    6,2,user4,Frank

What i'm needing to do, is figure out the adds/deletes between setId 1 and setId 2.

So, the result of the diff should (for the example) show:

Rows that are in both setId1 and setId2
    1,1,user1,John

Rows that are in setId 1 but not in setId2
    2,1,user2,Sally
    3,1,user3,Tom

Rows that are in setId 2 but not in setId1
    5,2,user2,Thomas
    6,2,user4,Frank

I think that's all the details. And i think i got the example correct. Any help would be appreciated. Solutions in MySQL or PHP are fine by me.


Solution

  • What we ended up doing, was adding a checksum column to the necessary tables being diffed. That way, instead of having to select multiple columns for comparison, the diff could be done against a single column (the checksum value).

    The checksum value was a simple md5 hash of a serialized array that contained the columns to be diffed. So... it was like this in PHP:

    $checksumString = serialize($arrayOfColumnValues);
    $checksumValue = md5($checksumString);
    

    That $checksumValue would then be inserted/updated into the tables, and then we can more easily do the joins/unions etc on a single column to find the differences. It ended up looking something like this:

    SELECT  i.id, i.checksumvalue
    FROM    SAMPLE_TABLE_I i
    WHERE   i.checksumvalue not in(select checksumvalue from SAMPLE_TABLE_II)
    UNION ALL
    SELECT  ii.id, ii.checksumvalue
    FROM    SAMPLE_TABLE_II ii
    WHERE   ii.checksumvalue not in(select checksumvalue from SAMPLE_TABLE_I);
    

    This runs fast enough for my purposes, at least for now :-)