Search code examples
sqldatabasepostgresqlgreatest-n-per-group

Remove duplicate values by taking latest data load


I'm working with enterprise data that looks like this.

| load_number | id        | time     | slot_time | region | network |
|-------------|-----------|----------|-----------|--------|---------|
| 1692        | 641131146 | 00:20:00 | 00:20:00  | FX-4   | SBOB    |
| 1692        | 641131146 | 00:20:00 | 00:20:30  | FX-4   | SBOB    |
| 1442        | 570732257 | 00:20:00 | 00:20:00  | FX-4   | SBOB    |
| 1442        | 570732257 | 00:20:00 | 00:20:30  | FX-4   | SBOB    |
| 1692        | 641131147 | 00:55:00 | 00:55:00  | FX-4   | SBOB    |
| 1692        | 641131147 | 00:55:00 | 00:55:30  | FX-4   | SBOB    |
| 1442        | 570732258 | 00:55:00 | 00:55:00  | FX-4   | SBOB    |
| 1442        | 570732258 | 00:55:00 | 00:55:30  | FX-4   | SBOB    |

The issue is that the company has bad data practices and changes/reuses IDs, but only updates the load_number field.

How do I construct my sql query to pull the latest loaded data like so:

| load_number | id        | time     | slot_time | region | network |
|-------------|-----------|----------|-----------|--------|---------|
| 1692        | 641131146 | 00:20:00 | 00:20:00  | FX-4   | SBOB    |
| 1692        | 641131146 | 00:20:00 | 00:20:30  | FX-4   | SBOB    |
| 1692        | 641131147 | 00:55:00 | 00:55:00  | FX-4   | SBOB    |
| 1692        | 641131147 | 00:55:00 | 00:55:30  | FX-4   | SBOB    |

Basically every field will match except the id and the load_number. So given that every field matches except those two fields, can I remove 'duplicates' by taking the row with the higher load_number.

I was thinking of some sort of descending rank() on the load_number, Any help is much appreciated!


Solution

  • Try something like this

    with max_load_numbers_by_id AS (
      SELECT et.id, MAX(et.load_number) AS max_load_number
      FROM enterprise_table et
      GROUP BY et.id 
    )
    
    SELECT et.*
    FROM enterprise_table et
    JOIN max_load_numbers_by_id mlnbi
      ON et.id = mlnbi.id
      AND et.max_load_number = mlnbi.load_number