Search code examples
sqlduplicatesignore-duplicates

SQL deduplicate from two columns


I've been struggling with this for quite some time but i just can't figure it out.

I have a table with 3 columns. 2 columns containing names and the third one containing the Damerau Levensthein distance ( http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance ) between those names.

each column has includes every single name, that means all the names present in the autor1 column are present in the autor2 column as well. as a result i have twice the rows needed, just with the autor1 and autor2 columns swapped.

as an example, row 3 equals to row 1 just with the autor columns swapped, the same goes for 2-4. how would i formulate a query that omits those "duplicates"?
id-|------autor1----|------autor2-----| dld
1 -| Abel, Gustav -| Abel, Gustave | 1
2 -| Abel, Gustav -| Abele, Gustav | 1
3 -| Abel, Gustave| Abel, Gustav --| 1
4 -| Abele, Gustav | Abel, Gustav -| 1

to
|------autor1----|------autor2-----| dld
| Abel, Gustav -| Abel, Gustave | 1
| Abel, Gustav -| Abele, Gustav | 1


Solution

  • Using NOT EXISTS works on all DBMS's I know of. The complexity of this is to not forget to include a clause on id. Without it, nothing would be returned.

    SELECT *
    FROM   YourTable yto
    WHERE  NOT EXISTS (
             SELECT  *
             FROM    YourTable yti
             WHERE   yti.autor2 = yto.autor1
                     AND yti.id > yto.id
           )
    

    Edit

    step by step, following is a breakdown of the logic behind the statement

    1. Get the first record (ID = 1)
    2. Is there a record where ID > 1 and autor1 = autor2 (Yes, ID 3) -> Ignore
    3. Get the next record (ID = 2)
    4. Is there a record where ID > 2 and autor1 = autor2 (Yes, ID 4) -> Ignore
    5. Get the next record (ID = 3)
    6. Is there a record where ID > 3 and autor1 = autor2 (No) -> Include
    7. Get the next record (ID = 4)
    8. Is there a record where ID > 4 and autor1 = autor2 (No) -> Include