Search code examples
postgresqlwindow-functions

PostgreSQL: Exclude duplicates as sorted by another key


Consider the following table that stores update history of some attributes of certain objects, organized by effective and published dates:

create table update_history(
  obj_id integer,
  effective date,
  published date,
  attr1 text,
  attr2 integer,
  attr3 boolean,
  primary key(obj_id, effective, published)
);
insert into update_history values
  (1, '2021-01-01', '2021-01-01', 'foo', null, null),
  (1, '2021-01-01', '2021-01-02', null, 1, false),
  (1, '2021-01-02', '2021-01-01', 'foo', 1, false),
  (1, '2021-01-02', '2021-01-02', 'bar', 1, false),
  (1, '2021-01-03', '2021-01-01', 'bar', 1, true),
  (1, '2021-01-04', '2021-01-01', 'bar', 1, true),
  (1, '2021-01-05', '2021-01-01', 'bar', 2, true),
  (1, '2021-01-05', '2021-01-02', 'bar', 1, true),
  (1, '2021-01-05', '2021-01-03', 'bar', 1, true),
  (1, '2021-01-06', '2021-01-04', 'bar', 1, true)
;

I need to write a PostgreSQL query that will simplify the history view for a given obj_id by excluding those update records that did not change any attributes from the immediately preceding update as ordered by effective and published columns. In essence those would be rows ## 6, 9 and 10, marked in italic in the table below:

# obj_id effective published attr1 attr2 attr3
1 1 2021-01-01 2021-01-01 foo (null) (null)
2 1 2021-01-01 2021-01-02 (null) 1 false
3 1 2021-01-02 2021-01-01 foo 1 false
4 1 2021-01-02 2021-01-02 bar 1 false
5 1 2021-01-03 2021-01-01 bar 1 true
6 1 2021-01-04 2021-01-01 bar 1 true
7 1 2021-01-05 2021-01-01 bar 2 true
8 1 2021-01-05 2021-01-02 bar 1 true
9 1 2021-01-05 2021-01-03 bar 1 true
10 1 2021-01-06 2021-01-04 bar 1 true

Keep in mind that in the real life case there are way more attributes to deal with and I don't want the query to get too messy.

The closest I got to the desired result was using the rank window function:

select
  obj_id, effective, published,
  attr1, attr2, attr3
from (
  select *,
    rank() over (
      partition by attr1, attr2, attr3
      order by effective, published
    ) as rank
  from update_history
  where obj_id = 1) as d
where rank = 1
order by effective, published;

That results in this:

obj_id effective published attr1 attr2 attr3
1 2021-01-01 2021-01-01 foo (null) (null)
1 2021-01-01 2021-01-02 (null) 1 false
1 2021-01-02 2021-01-01 foo 1 false
1 2021-01-02 2021-01-02 bar 1 false
1 2021-01-03 2021-01-01 bar 1 true
1 2021-01-05 2021-01-01 bar 2 true

As you can see, row #8 from the original table is erroneously excluded, although it changed attr2 from the its previous row, #7. Apparently, the problem is that partitioning is applied before sorting in the window definition.

I wonder if there is another way to accomplish this with a single PostgresSQL query.


Solution

  • I would use the lag() for this:

    select *
    from (
      select obj_id, effective, published,
             attr1, attr2, attr3, 
             (attr1, attr2, attr3) is distinct from lag( (attr1,attr2,attr3) ) over (partition by obj_id order by effective, published) as is_different
      from update_history
    ) t
    where is_different