Search code examples
sqlsql-servert-sqlgreatest-n-per-groupsql-server-2019

Find multiple duplicate rows in SQL Server table by name and date


I am trying to find duplicate records by name and the date submitted.

For example, I want to get all records with same name of "SampleOne" and same submitted date of "2016-10-14 12:44 39 460" and exclude any data that doesn't match.

I am using the query shown below, but I'm getting other data too. From my screenshot below I need only rows 1- 4 and not show the ones that doesn't have the same name and same submitDate.

SELECT TOP (13) 
    a.DocumentName, a.SubmitDateTime
FROM 
    Document AS a 
INNER JOIN
    (SELECT  
         DocumentName, SubmitDateTime, COUNT(*) AS DocCount
     FROM    
         Document
     GROUP BY 
         DocumentName, SubmitDateTime
     HAVING  
         (COUNT(*) > 1)) AS dt ON a.DocumentName = dt.DocumentName 
                               AND a.SubmitDateTime = dt.SubmitDateTime 
LEFT OUTER JOIN
    Staging b ON a.DocumentId = b.DocumentId
WHERE 
    b.DocumentId IS NULL  
    AND a.SubmitDateTime IS NOT NULL
    AND a.InsertDateTime IS NOT NULL
ORDER BY 
    a.SubmitDateTime 

Sample data

DocumentName SubmitDateTime
SampleOne 2016-10-14 12:44:39.460
SampleOne 2016-10-14 12:44:39.460
SampleTwo 2016-10-14 12:44:39.460
SampleTwo 2016-10-14 12:44:39.460
SampleThree 2016-10-14 12:45:32.813
SampleFive 2016-10-14 12:46:23.423
Doc Name 2016-10-17 15:15:05.523
Doc Name 2 2016-10-17 15:15:32.170
Doc Name 2016-10-17 15:17:32.153
Doc Name 2016-10-17 15:17:53.963
Doc Name 2016-10-17 15:18:15.977
Document Text 2016-10-17 15:19:22.950
Document Text 2016-10-17 15:22:25.463

I have managed to get the results that I want by using the query below and needs suggestion if I am on the right track.

   SELECT    t.*
FROM (
    SELECT
        s.*
      , COUNT(*) OVER (PARTITION BY s.DocumentName, s.SubmitDateTime, s.DocumentType) AS DocCount
    FROM Document s
    LEFT OUTER JOIN
     Staging b ON S.DocumentId = b.DocumentId
     WHERE 
     b.DocumentId IS NULL 
     AND s.SubmitDateTime IS NOT NULL
     AND s.InsertDateTime IS NOT NULL
     AND s.DocumentName IS NOT NULL
    ) t
    
WHERE t.DocCount> 1
  
ORDER BY t.DocumentName, t.SubmitDateTime

Solution

  • Since you didn't provide any clear input or output I am assuming the following:

    The subquery t will select only the necessary columns DocumentName and SubmitDateTime.

    I replaced theLEFT OUTER JOIN with Staging with a LEFT JOIN for simplicity, assuming you don't require columns from the Staging table in the final result. In the COUNT(*) function I included the OVER PARTITION to count duplicates only based on DocumentName and SubmitDateTime.

    SELECT t.DocumentName, t.SubmitDateTime
    FROM (
        SELECT s.DocumentName, s.SubmitDateTime, COUNT(*) OVER (PARTITION BY s.DocumentName, s.SubmitDateTime) AS DocCount
        FROM Document s
        LEFT JOIN Staging b ON s.DocumentId = b.DocumentId
        WHERE b.DocumentId IS NULL
            AND s.SubmitDateTime IS NOT NULL
            AND s.InsertDateTime IS NOT NULL
            AND s.DocumentName IS NOT NULL
    ) t
    WHERE t.DocCount > 1
    ORDER BY t.DocumentName, t.SubmitDateTime;