Search code examples
sqlsql-serversql-server-2014

SQL Server filtering out duplications and more


I need help with SQL Server 2014 (v12.0) filtering out duplicates and this might be too big of an ask, but any help would be appreciated. After importing a table of data, we have some duplicates that we would like to manage.

Example of the data is here:

eCRFNo NotesURL Assignee
101 aaa.com Person 1
101 aaa.com Person 2
101 aaa.com Person 3
101 bbb.com Person 4

The goal is to only have 1 unique Control Number (eCRFNo) in the table. If there are duplicates of eCRFNo and NotesURL, then the first one (ordered by eCRFNo, NotesURL, Assignee) should be picked as the final record and the Assignee's of the other matches be placed in a new column called Watcher (concatenated).

If there is a duplicate eCRFNo with a NotesURL that isn't a duplicate, then only the first (ordered by eCRFNo, NotesURL, Assignee) should be kept and "Replication" be added to a new column called Label.

The results of the example should look like the table below:

eCRFNo NotesURL Assignee Watcher Label
101 aaa.com Person 1 Person 2, Person 3 Replication

It is OK that we lost bbb.com in the data and that we lost Person 4.

I will post the tables if this will post at all. I am having trouble posting with tables and code. I need help with my help.

Here is the section of code that is close to what we want:

WITH cte AS 
(
    SELECT 
        Matched.*,
        ROW_NUMBER() OVER (PARTITION BY NotesURL ORDER BY NotesURL) routing_dup,
        ROW_NUMBER() OVER (PARTITION BY eCRFNo ORDER BY eCRFNo) crf_dup
    FROM 
        Matched
)
UPDATE m
SET m.Watcher = a.Watcher, 
    m.Label = a.Label, 
    m.RouteDup = a.routing_dup, 
    m.CRFDup = a.crf_dup
FROM
    (SELECT 
         cte1.UserID, cte1.NotesURL, cte1.eCRFNo, cte1.UID, 
         cte1.routing_dup, cte1.crf_dup, 
         STUFF((SELECT ';' + CAST(cte2.UserID AS varchar(max))
                FROM cte AS cte2
                WHERE cte1.NotesURL = cte2.NotesURL 
                  AND routing_dup > 1
                FOR XML PATH ('')), 1, 1, '') AS Watcher, 
         STUFF((SELECT 'Replication'
                FROM cte AS cte2
                WHERE cte1.eCRFNo = cte2.eCRFNo 
                  AND cte1.NotesURL <> cte2.NotesURL 
                  AND cte2.CRFDup = 2 
                  AND cte1.CRFDup = 1
                FOR XML PATH ('')), 1, 0, '') AS Label
     FROM 
         cte AS cte1) AS a
JOIN
    matched m ON m.UID = a.UID

Solution

  • Since you're using MS SQL v12.0 (SQL Server 2014), you'll need to use SQL Server's XML functionality (vs. using STRING_AGG in SQL Server >= 2017).

    Use a CTE to get the DENSE_RANK and ROW_NUMBER, then convert your rows to an XML string then use SQL Server's XML functionality.

    WITH CTE AS (SELECT *,
    DENSE_RANK() OVER (PARTITION BY ecRFNo ORDER BY ecRFNo, NotesURL ASC) as DR,
    ROW_NUMBER() OVER (PARTITION BY ecRFNo ORDER BY ecRFNo, NotesURL, Assignee ASC) as RN
    FROM your_table)
    SELECT *,
      CASE WHEN a.Watcher IS NOT NULL THEN 'Replication' ELSE 'Unique' END AS Label
      FROM (SELECT ecRFNo, NotesURL, Assignee,
      stuff( (select DISTINCT ', ' + cast(t.Assignee as varchar(max))
                   from CTE t
                    WHERE RN > 1 AND t.ecRFNo = a.ecRFNo AND t.NotesURL = a.NotesURL
                   for xml path ('')
                  ), 1, 2, '') AS Watcher
            FROM CTE a WHERE DR = 1 AND RN = 1
            GROUP BY ecRFNo, NotesURL, Assignee) a
    

    Note: If duplicates are not found, the column Watcher will display a NULL value and the column Label will display the value Unique.

    If you want to UPDATE your table and DELETE the rows that are no longer needed, add your new columns then wrap the query above in another CTE with an UDPATE statement:

    ALTER TABLE your_table
    ADD 
    Watcher VARCHAR(250),
    Label VARCHAR(25);
    
    WITH CTE AS (SELECT *,
    DENSE_RANK() OVER (PARTITION BY ecRFNo ORDER BY ecRFNo, NotesURL ASC) as DR,
    ROW_NUMBER() OVER (PARTITION BY ecRFNo ORDER BY ecRFNo, NotesURL, Assignee ASC) as RN
    FROM your_table),
    CTE2 AS 
    (SELECT *,
      CASE WHEN a.Watcher IS NOT NULL THEN 'Replication' ELSE 'Unique' END AS Label
      FROM (SELECT ecRFNo, NotesURL, Assignee,
      stuff( (select DISTINCT ', ' + cast(t.Assignee as varchar(max))
                   from CTE t
                    WHERE RN > 1 AND t.ecRFNo = a.ecRFNo AND t.NotesURL = a.NotesURL
                   for xml path ('')
                  ), 1, 2, '') AS Watcher
            FROM CTE a WHERE DR = 1 AND RN = 1
            GROUP BY ecRFNo, NotesURL, Assignee) a)
    UPDATE your_table
    SET your_table.Watcher = b.Watcher, your_table.Label = b.Label
    FROM CTE2 b
    WHERE your_table.ecRFNo = b.ecRFNo 
      AND your_table.NotesURL = b.NotesURL 
      AND your_table.Assignee = b.Assignee
    

    Finally, DELETE the rows you no longer require by using the Label column; anything not touched by the UPDATE above will remain NULL.

    DELETE FROM your_table WHERE Label IS NULL
    

    Result:

    | ecRFNO | NotesURL | Assignee | Watcher            | Label       |
    |--------|----------|----------|--------------------|-------------|
    | 101    | aaa.com  | Person 1 | Person 2, Person 3 | Replication |
    

    Fiddle here.