Search code examples
mysqlsqldeduplication

Inserting millions of records with deduplication SQL


This is a theoretical scenario, and I am more than amateur when it comes to large scale SQL databases...

How would I go about inserting around 2million records into an existing database off 6million records (table1 into table2), whilst at the same time using email de-duplication (some subscribers may already exist in site2, but we don't want to insert those that already exist)?

I understand how to simply get the records from site 1 and add them into site 2, but how would we do this on such a large scale, and not causing data duplication? Any reading sources would be more than helpful for me, as ive found that a struggle.

i.e.: Table 1: site1Subscribers

site1Subscribers(subID, subName, subEmail, subDob, subRegDate, subEmailListNum, subThirdParties)

Table 2: site2Subscribers

site2Subscribers(subID, subName, subEmail, subDob, subRegDate, subEmailListNum, subThirdParties)


Solution

  • I would try something like this:

    insert into site2Subscribers
    select * from site1Subscribers s1
    left outer join site2Subscribers s2
        on s1.subEmail = s2.subEmail
    where s2.subEmail is null;
    

    The left outer join along with the null check will return only those rows from site1Subscribers that have no matching entry in site2Subscribers.