Search code examples
sqlapache-spark-sqldatabricksazure-databricks

Delete Duplicate using SPARK SQL


The following code works fine in the Databricks Spark SQL

with CTE1 as
(
 select *,
        row_number()over(Partition by ID order by Name) as r
 from Emp
)
select * from CTE1 where r>1

But for the DELETE statement:

with CTE1 as
(
 select *,
        row_number()over(Partition by ID order by Name) as r
 from Emp
)

DELETE from CTE1 where r>1

there is an Error in SQL statement :

Analysis exception: Table Not found Emp


Solution

  • The syntax you want is only available in SQL Server. Assuming that Name is unique and not NULL, you can use an alternative method such as this:

    delete from emp
        where name > (select min(emp2.name)
                      from emp emp2
                      where emp2.id = emp.id
                     );
    

    Otherwise, use the table's primary key for the comparison.