Search code examples
sqldatabasejdbcdiffsystem-design

Is comparing two tables faster by importing them into a sql database or by using jdbc?


Background

I need to compare two tables in two different datacenters to make sure they're the same. The tables can be hundreds of millions, even a billion lines.

An example of this is having a production data pipeline and a development data pipeline. I need to verify that the tables at the end of each pipeline are the same, however, they're located in different datacenters.

The tables are the same if all the values and datatypes for each row and column match. There are primary keys for each table.

Here's an example input and output:

Input

table1:
Name | Age |
Alice| 25.0|
Bob  | 49  |
Jim  | 45  |
Cal  | 52  |

table2:
Name | Age |
Bob  | 49  |
Cal  | 42  |
Alice| 25  |

Output:

table1 missing rows (empty):
Name | Age |
     |     |

table2 missing rows:
Name | Age |
Jim  | 45  |

mismatching rows:
Name | Age | table |
Alice| 25.0| table1|
Alice| 25  | table2|
Cal  | 52  | table1|
Cal  | 42  | table2|

Note: The output doesn't need to be exactly like the above format, but it does need to contain the same information.

Question

Is it faster to import these tables into a new, common SQL environment, then use SQL to produce my desired output?

OR

Is it faster to use something like JDBC, retrieve all rows for each table, sort each table, then compare them line by line to produce my desired output?

Edits:

The above solutions would be executed at a datacenter that's hosting one of the tables. In the first solution, the only purpose for creating a new database would be to compare these tables using SQL, there are no other uses.


Solution

  • You should definitively start with the database option. Especially if the databases are connected with a database link you can easy set up the transfer of the data.

    Such comparison often leads to a full outer join of the two sources and the experience tell us that DIY joins are notorically less performant that the native database implementation (you can deploy for example a parallel option).

    Anyway you may try to implement some sofisticated algoritm that can make the compare without the necessity to transfer the whole table.

    An example is based on the Merkle Trees where you first scan both source in their location to recognise which parts are identical (that can be ignored) and transfer and compare only the party with a difference.

    So if you expect the tables are nearly identical and have keys that allows some hierarchy such approach could end better than a brute force full compare.