Search code examples
sql-serverduplicatesminpartition-by

Select min BeginDate from duplicates


enter image description here

select [BK_Vendedor]
       ,[NIF]
      ,[BeginDate]
      ,[EndDate]
from (
      select 
       [BK_Vendedor]
      ,[NIF]
      ,[BeginDate]
      ,[EndDate],
             count(*) over (partition by [BK_Vendedor]) as dc
      from [test_SA].[dbo].[Dim_Vendedor]
     ) as T
where dc > 1

This table is the result from this query that only picks duplicates values from the BK_Vendedor column from another bigger table. but in this same query i also want only the rows with the minimum BeginDate from each BK_Vendedor. How can i do this?


Solution

  • Add a ROW_NUMBER partitioned on BK_Vendedor and filter on the row number being = 1. When you partition ROW_NUMBER by a field (or multiple fields!) it resets the count to 1 every time the field changes. You effectively group by the field and each group gets its own series of row numbers. We use that to get the first BeginDate in each group.

    Note that if we wanted the last BeginDate we would reverse the ordering so the last was #1 (ROW_NUMBER () over (partition by BK_Vendedor ORDER BY BeginDate DESC) as Ord) and keep the WHERE clause specifying Ord = 1

    with cteSampleData as (
        SELECT * FROM (VALUES (100, 10, '2022-01-10', '2022-01-25') 
            , (100, 10, '2022-01-08', '2022-01-15') 
            , (101, 11, '2022-03-10', '2022-04-15') 
            , (102, 12, '2022-02-10', NULL) 
            , (103, 13, '2022-02-15', NULL) 
            , (103, 13, '2022-03-15', '2022-04-10') 
            , (103, 15, '2022-02-21', '2022-04-01') --EDIT: Add this row
        ) as DimVend(BK_Vendedor,NIF,BeginDate,EndDate)
    ) --The above is just sample data as I don't have your [test_SA].[dbo].[Dim_Vendedor]
    select [BK_Vendedor]
           ,[NIF]
          ,[BeginDate]
          ,[EndDate]
    from (
          select 
           [BK_Vendedor]
          ,[NIF]
          ,[BeginDate]
          ,[EndDate],
                 count(*) over (partition by [BK_Vendedor]) as dc
                 --Add the line below
                 , ROW_NUMBER () over (partition by BK_Vendedor ORDER BY BeginDate) as Ord
          from cteSampleData
         ) as T
    where dc > 1
        AND Ord = 1 --And add this line too
    

    Input looks like this:

    BK_Vendedor NIF BeginDate EndDate
    100 10 2022-01-10 2022-01-25
    100 10 2022-01-08 2022-01-15
    101 11 2022-03-10 2022-04-15
    102 12 2022-02-10 NULL
    103 13 2022-02-15 NULL
    103 13 2022-03-15 2022-04-10

    Input (in original order) with the addition of dc and Ord:

    BK_Vendedor NIF BeginDate EndDate dc Ord
    100 10 2022-01-10 2022-01-25 2 2
    100 10 2022-01-08 2022-01-15 2 1
    101 11 2022-03-10 2022-04-15 1 1
    102 12 2022-02-10 NULL 1 1
    103 13 2022-02-15 NULL 3 1
    103 13 2022-03-15 2022-04-10 3 3
    103 15 2022-02-21 2022-04-01 3 2

    and Output looks like this:

    BK_Vendedor NIF BeginDate EndDate
    100 10 2022-01-08 2022-01-15
    103 13 2022-02-15 NULL