Search code examples
sqlsql-servert-sql

Error trying copy a table into another without duplicates


Basically, I want to copy table DET to table DET_NEW. I know that DET has some rows duplicated so I want to do it inserting duplicated rows just once. As you can see, I want to copy these rows by periods using HDR table.

Edit to add info: DET has no primary keys (that's why it has duplicated rows). DET_NEW has a composite primary key with FK1 and FK2.

PROBLEM:

I know that this is not a new topic. I've tried all the solutions I've found and SQL Server returns an error about primary key constraint violation ("cannot insert duplicate key"). The queries were executed explicitly just with one core.

I saw that MySQL has INSERT IGNORE but I think there is nothing similar in SQL Server.

My solutions are the following ones:

Query 1:

INSERT INTO DET_NEW (FK1, FK2, value)
    SELECT DISTINCT D.FK1, D.FK2, D.value
    FROM HDR AS H, DET AS D
    WHERE H.Date  >= CONVERT( datetime, '2015-01-01 00:00:00', 20 )
      AND H.Date < CONVERT( datetime, '2016-01-01 00:00:00', 20 ) 
      AND H.FK1 = D.FK1

Query 2:

INSERT INTO DET_NEW (FK1, FK2, value)
    SELECT D.FK1, D.FK2, D.value
    FROM 
        (SELECT DISTINCT D.FK1, D.FK2, D.value
         FROM HDR AS H, DET AS D
         WHERE H.Date  >= CONVERT( datetime, '2015-01-01 00:00:00', 20)
           AND H.Date < CONVERT( datetime, '2016-01-01 00:00:00', 20 ) 
           AND H.FK1 = D.FK1) D

Query 3:

INSERT INTO DET_NEW (FK1, FK2, value)
    SELECT D.FK1, D.FK2, D.value
    FROM HDR AS H, DET AS D
    WHERE H.Date  >= CONVERT(datetime, '2018-02-01 00:00:00', 20)
      AND H.Date < CONVERT(datetime, '2018-02-05 00:00:00', 20) 
      AND H.FK1 = D.FK1
    GROUP BY D.FK1, D.FK2

Query 4:

WITH cte AS 
(
    SELECT 
        D.FK1, D.FK2, D.value,
        ROW_NUMBER() OVER (PARTITION BY D.FK1, D.FK2, D.value ORDER BY D.FK1) AS [rn]
    FROM 
        HDR AS H, DET AS D
    WHERE 
        H.Date >= CONVERT(datetime, '2018-02-01 00:00:00', 20)
        AND H.Date < CONVERT(datetime, '2018-02-03 00:00:00', 20) 
        AND H.FK1 = D.FK1
)
INSERT INTO DET_NEW (FK1, FK2, value)
    SELECT cte.FK1, cte.FK2, cte.value
    FROM cte
    WHERE cte.[rn] = 1

ADDITIONAL INFO

I executed the following query: in the external part, the code checks if there is any duplication; on the inside, there is a SELECT DISTINCT, the same as Query 2. The results are more than 1k rows, so I've something incorrect in the queries.

SELECT D.FK1, D.FK2, COUNT(D.FK1) AS count
FROM (
    SELECT DISTINCT D.FK1, D.FK2, D.value
    FROM HDR AS H,
        DET AS D
    WHERE H.Date  >= CONVERT( datetime, '2018-02-01 00:00:00', 20 )
        and H.Date < CONVERT( datetime, '2018-02-03 00:00:00', 20 ) 
        and H.FK1 = D.FK1
    )D
GROUP BY D.FK1, D.FK2
HAVING 
    COUNT(*) > 1

The questions are: why am I inserting duplicates? Is there another efficient way for a large database (billions of rows)?


Solution

  • It's likely that for some combinations of FK1 and FK2, there is more than one different value in DET. The following query should demonstrate that

    SELECT FK1, FK2, MIN(value), MAX(value), COUNT(DISTINCT value)
    FROM DET
    GROUP BY FK1, FK2
    HAVING COUNT(DISTINCT value) > 1
    

    You'll either need to include value in the composite key in DET_NEW or work out how you want to select which value to use for each key, e.g. using a MIN or MAX. The query that @MoinulIslam provided will also help you select a single value for each key. In that query it is just choosing the first one.