Search code examples
mysqlsql-serverdata-consistency

MySQL - sql server: consistency check


I'm trying to check the results of a data load between two databases. Unfortunately, I only have access to one database (MySQL) directly, the company managing MSSQL can expose it to us via an API.

What I would like to do is check the consistency of certain columns across rowsets. Originally, I had hoped to be able to run a CRC or hash check against the columns, but there doesn't seem to be a compatible way of doing this.

For example, we can run CRC32 against a column in MySQL, but there isn't a reliable way of doing the same on MSSQL. Alternatively, there's CHECKSUM_AGG on MSSQL, but no alternative on MySQL.

The end result is that I would like to do a binary search if the checksums differ to identify the rows that require changing.

There is currently no bulk load interface, and SSIS is not available (the MSSQL servers are not part of my company).


Solution

  • I thought I'd come back to this and describe the solution we ended up implementing. This was a major pain in the neck!

    Firstly, because of the fixed versions of MySQL on our server and MSSQL on the remote server, there were no common encoding methods. The MSSQL API returned data in UTF-16LE, the MySQL database had Unicode data stored in Latin-1 tables sigh

    Firstly, we concatenated the fields that we were comparing, then we MD5'd the result. In order to get the MySQL result to match the output of the MSSQL HASHBYTES function, we had to do this:

    SELECT ABS(CONV(CONCAT(
        IF(MID(MD5(CONC), -8 , 1) >= "8", "FFFFFFFF", ""),
        RIGHT(MD5(CONC), 8)
    ), 16, -10 )) AS CALC
    

    where CONC is the result of a subselect concatenating the fields we are interested in.

    On the MSSQL server, we had to do the following query:

    SELECT ABS(CONVERT(INT,HASHBYTES('MD5',
        CONVERT(NVARCHAR(4000), FIELD1 ) + 
        CONVERT(NVARCHAR(4000), FIELD2 ) + ...
    

    Then, we took the sum of the MD5 across the entire range, modulo three large-ish primes(311,313,317), as per Chinese Remainder Theorem

    This gave us three numbers for the range we were checking. We could be reasonably certain that if all three numbers matched for a given range on each server, then the data was consistent.

    I'll spare you the details of the munging we had to do to get Unicode in Latin-1 transliterated to UTF-16LE