Search code examples
sqlsql-servercountsubquerywindow-functions

Filter over duplicate records


I'm trying to identify a distinct list of person_id that have the same account_id as another person.

The caveat is the person_id who was last given the overlapping account_id should not be included in that list.

        id          person_id     account_id
   +------------------------------------------+
   |            |             |               |
   |     1      |     1       |      10       |
   +------------------------------------------+
   |            |             |               |
   |     2      |     2       |      10       |
   +------------------------------------------+
   |            |             |               |
   |     3      |     3       |      11       |
   +------------------------------------------+

please note: this is a slightly simplified example not to be taken too literally.


This is the query I currently have

SELECT STRING_AGG(person_id, ',')
FROM accounts_map
WHERE created_at > '2001-01-10' -- ignore records smaller than 2001-01-10
GROUP BY account_id -- group by account id
HAVING count(*) > 1 -- any account that have multiple matches

What I can't figure out is

  1. How for each account_id match overlap, exclude the latest record
  2. How to ignore records that have account_id marked as null, somehow <> NULL is not working
CREATE TABLE accounts_map
    ([id] int, [person_id] int, [account_id] int, [created_at] DATETIME)
;

INSERT INTO accounts_map
    ([id], [person_id], [account_id], [created_at])
VALUES
    (1, 1, 10, '2001-01-10'),
    (2, 1, 11, '2001-01-10'),
    (2, 1, 12, '2001-01-10'),

    (3, 2, 20, '2019-01-11'),
    (4, 2, 21, '2019-01-11'),
    (5, 2, 22, '2019-01-11'),

    (3, 3, 30, '2019-01-12'),
    (4, 3, 31, '2019-01-12'),
    (5, 3, 32, '2019-01-12'),

    -- same account ids as records with id (1,2,3) but inserted in different order
    (6, 4, 12, '2019-01-13'),
    (7, 4, 11, '2019-01-13'),
    (8, 4, 10, '2019-01-13'),

    -- same account ids as records with id (3,4,5)
    (9, 5, 20, '2019-01-14'),
    (10, 5, 21, '2019-01-14'),
    (11, 5, 22, '2019-01-14'),

    -- same account ids as records with id (9,10,11)
    (12, 6, 20, '2019-01-15'),
    (13, 6, 21, '2019-01-15'),
    (14, 6, 22, '2019-01-15')
;

expected output

id
---
1 (not expected if date range cutoff specified) created_at > '2001-01-10'
2
4 (not expected if date range cutoff specified) created_at > '2001-01-10'
5

If it has an account that overlaps, the latest user who was granted the account becomes the valid one corrupting the old ones.

http://sqlfiddle.com/#!18/f53e5/1


Solution

  • In a subqueryn you can do a window count within groups of records sharing the same account_id and use row_number() to rank them by date. Then the outer query filters on records whose window count is greater than 1 and returns all records but the latest:

    select person_id, account_id, created_at
    from (
        select 
            t.*, 
            row_number() over(partition by account_id order by created_at desc) rn,
            count(*) over(partition by account_id) cnt
        from accounts_map t
    ) t
    where cnt > 1 and rn > 1
    

    Demo on DB Fiddle:

    person_id | account_id | created_at         
    --------: | ---------: | :------------------
            1 |         10 | 10/01/2001 00:00:00
            1 |         11 | 10/01/2001 00:00:00
            1 |         12 | 10/01/2001 00:00:00
            5 |         20 | 14/01/2019 00:00:00
            2 |         20 | 11/01/2019 00:00:00
            5 |         21 | 14/01/2019 00:00:00
            2 |         21 | 11/01/2019 00:00:00
            5 |         22 | 14/01/2019 00:00:00
            2 |         22 | 11/01/2019 00:00:00
    

    Note: you provided sample data but unfortunately not the associated expected results, for us to validate the output of the query.