Search code examples
sqlsybaseduplicatessqlanywhere

How to delete duplicate rows in sybase, when you have no unique key?


Yes, you can find similar questions numerous times, but: the most elegant solutions posted here, work for SQL Server, but not for Sybase (in my case Sybase Anywhere 11). I have even found some Sybase-related questions marked as duplicates for SQL Server questions, which doesn't help.

One example for solutions I liked, but didn't work, is the WITH ... DELETE ... construct.

I have found working solutions using cursors or while-loops, but I hope it is possible without loops.

I hope for a nice, simple and fast query, just deleting all but one exact duplicate.

Here a little framework for testing:

IF OBJECT_ID( 'tempdb..#TestTable' ) IS NOT NULL
  DROP TABLE #TestTable;

CREATE TABLE #TestTable (Column1 varchar(1), Column2 int);

INSERT INTO #TestTable VALUES ('A', 1);
INSERT INTO #TestTable VALUES ('A', 1); -- duplicate
INSERT INTO #TestTable VALUES ('A', 1); -- duplicate
INSERT INTO #TestTable VALUES ('A', 2);
INSERT INTO #TestTable VALUES ('B', 1);
INSERT INTO #TestTable VALUES ('B', 2);
INSERT INTO #TestTable VALUES ('B', 2); -- duplicate
INSERT INTO #TestTable VALUES ('C', 1);
INSERT INTO #TestTable VALUES ('C', 2);

SELECT * FROM #TestTable ORDER BY Column1,Column2;

DELETE <your solution here>

SELECT * FROM #TestTable ORDER BY Column1,Column2;

Solution

  • Ok, now that I know the ROWID() function, solutions for tables with primary key (PK) can be easily adopted. This one first selects all rows to keep and then deletes the remaining ones:

    DELETE FROM #TestTable
    FROM #TestTable
    LEFT OUTER JOIN (
      SELECT MIN(ROWID(#TestTable)) rowid
      FROM #TestTable
      GROUP BY Column1, Column2
    ) AS KeepRows ON ROWID(#TestTable) = KeepRows.rowid
    WHERE KeepRows.rowid IS NULL;
    

    ...or how about this shorter variant? I like!

    DELETE FROM #TestTable
    WHERE ROWID(#TestTable) NOT IN (
      SELECT MIN(ROWID(#TestTable))
      FROM #TestTable
      GROUP BY Column1, Column2
    );
    

    In this post, which inspired me most, is a comment that NOT IN might be slower. But that's for SQL server, and sometimes elegance is more important :) - I also think it all depends on good indexes.

    Anyway, usually it is bad design, to have tables without a PK. You should at least add an "autoinc" ID, and if you do, you can use that ID instead of the ROWID() function, which is a non-standard extension by Sybase (some others have it, too).