sql database algorithm logic generic-programming

Identifying Differences Efficiently

Every day, we receive huge files from various vendors in different formats (CSV, XML, custom) which we need to upload into a database for further processing.

The problem is that these vendors will send the full dump of their data and not just the updates. We have some applications where we need only send the updates (that is, the changed records only). What we do currently is to load the data into a staging table and then compare it against previous data. This is painfully slow as the data set is huge and we are occasionally missing SLAs.

Is there a quicker way to resolve this issue? Any suggestions or help greatly appreciated. Our programmers are running out of ideas..

Solution

There are a number of patterns for detecting deltas, i.e. changed records, new records, and deleted records, in full dump data sets.

One of the more efficient ways I've seen is to create hash values of the rows of data you already have, create hashes of the import once it's in the database, then compare the existing hashes to the incoming hashes.

Primary key match + hash match = Unchanged row

Primary key match + hash mismatch = Updated row

Primary key in incoming data but missing from existing data set = New row

Primary key not in incoming data but in existing data set = Deleted row

How to hash varies by database product, but all of the major providers have some sort of hashing available in them.

The advantage comes from only having to compare a small number of fields (the primary key column(s) and the hash) rather than doing a field by field analysis. Even pretty long hashes can be analyzed pretty fast.

It'll require a little rework of your import processing, but the time spent will pay off over and over again in increased processing speed.