Search code examples
databasepostgresqlselectwindow-functionsgaps-and-islands

conditionally remove duplicates from database in postgres


I'd like to remove duplicates from the column 'value' but only if there was no change from the previous update. I read tutorials about lag and lead but couldn't find an example with removing duplicates.

Original:

+----+-------+-------+------------------------+
| ID | subID | value |       updated_at       |
+----+-------+-------+------------------------+
|  1 |     2 | 2.20  | 2020-02-16 07:36:25+01 |
|  1 |     2 | 2.20  | 2020-02-16 07:31:25+01 |
|  1 |     2 | 2.20  | 2020-02-16 07:26:25+01 |
|  1 |     2 | 2.30  | 2020-02-16 07:21:25+01 |
|  1 |     2 | 2.20  | 2020-02-16 07:16:25+01 |
|  1 |     2 | 2.20  | 2020-02-16 07:11:25+01 |
+----+-------+-------+------------------------+

Desired output:

+----+-------+-------+------------------------+
| ID | subID | value |       updated_at       |
+----+-------+-------+------------------------+
|  1 |     2 | 2.20  | 2020-02-16 07:36:25+01 |
|  1 |     2 | 2.30  | 2020-02-16 07:21:25+01 |
|  1 |     2 | 2.20  | 2020-02-16 07:16:25+01 | 
+----+-------+-------+------------------------+

Solution

  • I'd use lag or lead and remove by ctid:

    DELETE FROM yourtable WHERE ctid IN
    (
      SELECT
        ctid
      FROM 
      (
        SELECT 
          ctid,
          value,
          LAG(value) OVER(PARTITION BY id, subid ORDER BY updated_at) pre
        FROM 
          yourtable t
      ) t
      WHERE value = pre 
    )
    

    As with any delete query from the internet, run it against a copy of the table...