I have some code stats in greenplum table A
| id | file | repo | lang | line |
-------------------------------------
| a | /a.txt | r1 | txt | 3 |
| a | /b.c | r1 | c | 5 |
| b | /x.java| r1 | java | 33 |
| c | /f.cpp | r2 | c++ | 23 |
| a | /a.txt | r3 | txt | 3 |
| a | /b.c | r3 | c | 5 |
but the last two rows code indicate this code is come form repo r1, because the commit id is same with first two rows. I want to remove the duplicate rows, and insert result to table B:
| id | file | repo | lang | line |
-------------------------------------
| a | /a.txt | r1 | txt | 3 |
| a | /b.c | r1 | c | 5 |
| b | /x.java| r1 | java | 33 |
| c | /f.cpp | r2 | c++ | 23 |
the row can be distinct by: id + file + repo
Thanks in advance.
You can use NOT EXISTS to check that a duplicate does not exist:
SELECT *
FROM t
WHERE NOT EXISTS (
SELECT 1
FROM t AS x
WHERE x.id = t.id
AND x.file = t.file
AND x.repo < t.repo
)