Search code examples
sqlsql-serverdata-cleaning

How to merge row based on condition in SQL Server


I have table in a SQL Server database table Department and it has 4 columns:

staffNumber department startDate endDate
100 A 2016-09-22 18:30:00.000 2020-02-06 18:29:00.000
100 B 2020-02-06 18:30:00.000 2022-08-21 18:29:00.000
100 A 2022-08-21 18:30:00.000 2079-12-31 18:29:00.000
101 A 2018-06-03 18:30:00.000 2023-08-20 18:29:00.000
101 B 2023-08-20 18:30:00.000 2079-12-31 18:29:00.000
102 A 2022-12-21 18:30:00.000 2023-12-29 18:29:00.000
102 A 2023-12-29 18:30:00.000 2079-12-31 18:29:00.000
103 A 2016-06-22 18:30:00.000 2018-03-05 18:29:00.000
103 A 2018-03-05 18:30:00.000 2021-03-05 18:29:00.000
103 A 2021-03-05 18:30:00.000 2079-12-31 18:29:00.000
104 A 2016-06-22 18:30:00.000 2021-03-05 18:29:00.000
104 A 2021-03-05 18:30:00.000 2079-12-31 18:29:00.000
105 B 2016-08-22 18:30:00.000 2021-02-07 18:29:00.000
105 B 2021-02-09 18:30:00.000 2079-12-31 18:29:00.000

I want to merge more then rows based on specific conditions:

  • Same staffNumber
  • Same department
  • Consecutive records have zero day difference between the expiryDate of the current row and the effectiveDate of the next row

This is my SQL query I tried, but I am not getting the expected output:

WITH DeptData_CTE AS 
(
    SELECT 
        staffNumber,
        department,
        startDate,
        endDate
        LEAD(startDate) OVER (PARTITION BY staffNumber, department ORDER BY startDate) AS nextstartDate,
        LAG(endDate) OVER (PARTITION BY staffNumber, department ORDER BY startDate) AS prevendDate
    FROM 
        department
),
MergedData_CTE AS 
(
    SELECT 
        staffNumber,
        department,
        MIN(startDate) AS startDate,
        MAX(endDate) AS endDate
    FROM 
        DeptData_CTE
    -- We merge rows where previous expiry date and current effective date are consecutive
    WHERE 
        prevendDate IS NULL 
        OR DATEDIFF(day, prevendDate, startDate) != 0
    GROUP BY 
        staffNumber, 
        department
)
-- Final selection of rows
SELECT 
    staffNumber,
    department,
    startDate,
    endDate
FROM 
    MergedData_CTE
ORDER BY 
    staffNumber, 
    department, 
    startDate;

I am getting expected startDate with merged rows with two consecutive rows, but not getting expected endDate.

My expected output would be:

staffNumber department startDate endDate
100 A 2016-09-22 18:30:00.000 2020-02-06 18:29:00.000
100 B 2020-02-06 18:30:00.000 2022-08-21 18:29:00.000
100 A 2022-08-21 18:30:00.000 2079-12-31 18:29:00.000
101 A 2018-06-03 18:30:00.000 2023-08-20 18:29:00.000
101 B 2023-08-20 18:30:00.000 2079-12-31 18:29:00.000
102 A 2022-12-21 18:30:00.000 2079-12-31 18:29:00.000
103 A 2016-06-22 18:30:00.000 2079-12-31 18:29:00.000
104 A 2016-06-22 18:30:00.000 2079-12-31 18:29:00.000
105 B 2016-08-22 18:30:00.000 2021-02-07 18:29:00.000
105 B 2021-02-09 18:30:00.000 2079-12-31 18:29:00.000

For staff numer 105, the row is not merged because date difference is >= 1 day.

I do not want to use a stored procedure. I just want to use SELECT SQL query


Solution

  • This is gaps and islands task.

    1. We found gaps:
    • previous department<>current department (if this is first row we take default for lag() - current department.
    • days between previous endDate and current startDate >0
    1. Assign number for gaps. This is sum(isGap)over(partition by staffNumber order by startDate) gn.

    2. Then merge islands by group by.

    See example

    with DeptData as(
    select *
      ,case when department<>lag(department,1,department)
                  over(partition by staffNumber order by startDate) then 1
            when datediff(day,lag(endDate,1,startDate)over(partition by staffNumber order by startDate)
                             ,startDate)>0  then 1
          else 0
          end isGap
       from department
    )
    ,gaps as(
      select * 
        ,sum(isGap)over(partition by staffNumber order by startDate) gn
      from DeptData
    )
    select staffNumber,department
      ,min(startDate)startDate,max(endDate)endDate
    from gaps
    group by staffNumber,department,gn
    order by staffNumber,startDate,department
    
    staffNumber department startDate endDate
    100 A 2016-09-22 18:30:00.000 2020-02-06 18:29:00.000
    100 B 2020-02-06 18:30:00.000 2022-08-21 18:29:00.000
    100 A 2022-08-21 18:30:00.000 2079-12-31 18:29:00.000
    101 A 2018-06-03 18:30:00.000 2023-08-20 18:29:00.000
    101 B 2023-08-20 18:30:00.000 2079-12-31 18:29:00.000
    102 A 2022-12-21 18:30:00.000 2079-12-31 18:29:00.000
    103 A 2016-06-22 18:30:00.000 2079-12-31 18:29:00.000
    104 A 2016-06-22 18:30:00.000 2079-12-31 18:29:00.000
    105 B 2016-08-22 18:30:00.000 2021-02-07 18:29:00.000
    105 B 2021-02-09 18:30:00.000 2079-12-31 18:29:00.000

    fiddle

    Before group by we get this result.
    There you see gaps (isGap) and islands (gn) particularily for stuffNumber.

    staffNumber department startDate endDate isGap gn
    100 A 2016-09-22 18:30:00.000 2020-02-06 18:29:00.000 0 0
    100 B 2020-02-06 18:30:00.000 2022-08-21 18:29:00.000 1 1
    100 A 2022-08-21 18:30:00.000 2079-12-31 18:29:00.000 1 2
    101 A 2018-06-03 18:30:00.000 2023-08-20 18:29:00.000 0 0
    101 B 2023-08-20 18:30:00.000 2079-12-31 18:29:00.000 1 1
    102 A 2022-12-21 18:30:00.000 2023-12-29 18:29:00.000 0 0
    102 A 2023-12-29 18:30:00.000 2079-12-31 18:29:00.000 0 0
    103 A 2016-06-22 18:30:00.000 2018-03-05 18:29:00.000 0 0
    103 A 2018-03-05 18:30:00.000 2021-03-05 18:29:00.000 0 0
    103 A 2021-03-05 18:30:00.000 2079-12-31 18:29:00.000 0 0
    104 A 2016-06-22 18:30:00.000 2021-03-05 18:29:00.000 0 0
    104 A 2021-03-05 18:30:00.000 2079-12-31 18:29:00.000 0 0
    105 B 2016-08-22 18:30:00.000 2021-02-07 18:29:00.000 0 0
    105 B 2021-02-09 18:30:00.000 2079-12-31 18:29:00.000 1 1

    We can use partition by stuffNumber,department. Then gap is when days between previous endDate and current startDate >0. See example 2 in fiddle