How to write data with pyspark to Azure Sql database?

I have written data to an Azure sql database with the following code:

def write_to_sqldatabase(final_table, target_table):
    #Write table data into a spark dataframe
    final_table.write.format("jdbc") \
        .option("url", f"jdbc:sqlserver://{SERVER};databaseName={DATABASE}") \
        .option("dbtable", f'....{target_table}') \
        .option("user", USERNAME) \
        .option("password", PASSWORD) \
        .mode("append") \
        .save()

This doesn't work anymore. I now get the following error:

Py4JJavaError: An error occurred while calling o11913.save. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 63.0 failed 4 times, most recent failure: Lost task 2.3 in stage 63.0 (TID 322) (vm-32517978 executor 1): java.sql.BatchUpdateException: Violation of PRIMARY KEY constraint 'PK_tblSFDC_Account'. Cannot insert duplicate key in object '.....table'. The duplicate key value is (0xxxxxx).

Previously I also had this problem but fixed it by truncating the table in the sql database.

Now I try to do the same, but I keep the error of the violation.

How is this happening?

Solution

Violation of PRIMARY KEY constraint 'PK_tblSFDC_Account'. Cannot insert duplicate key in object '.....table'. The duplicate key value is (0xxxxxx).

The error message you are seeing indicates that there is a problem with the primary key constraint in your SQL Server database. This constraint is used to ensure that each row in a table is unique and can be identified by a unique key. The error message suggests that there is a violation of this constraint, which means that there is a duplicate row or a row with a non-unique key in the table. You may need to review your data and make sure that each row has a unique key value.

I agree with @ThomA When you encounter this error while trying to insert data into an empty table, it means that the information you're trying to add has multiple rows with the same value in a column that is designated as the primary key. The primary key ensures each row is distinct and identifiable by a unique key. Since your table is initially empty, the error is signaling that the data you're attempting to insert contradicts this uniqueness requirement. It seems there are rows with duplicate key values in your data.

I am INSERTING into a EMPTY table in SQL database: enter image description here Sample Data:

data = [(1, "Movie 1"), (2, "Movie 2"), (1, "Duplicate Movie")]
columns = ["movie_id", "title"]

(1, "Duplicate Movie") is the duplicate record with the same PK.

final_table = final_table.dropDuplicates(["movie_id"])
target_table = "movies"
final_table.write.format("jdbc") \
    .option("url", jdbcUrl) \
    .option("dbtable", target_table) \
    .option("user", connectionProperties["user"]) \
    .option("password", connectionProperties["password"]) \
    .mode("append") \
    .save()

final_table = final_table.dropDuplicates(["movie_id"])

If there happen to be any duplicates in the dataframe, this process ensures their removal.

Result: enter image description here

In SQL you can also use (CTE) to get rid of duplicate rows:

WITH CTE AS (
    SELECT 
        movie_id,
        title,
        ROW_NUMBER() OVER (PARTITION BY movie_id ORDER BY (SELECT NULL)) AS RowNum
    FROM [dbo].[movies]
)
DELETE FROM CTE WHERE RowNum > 1;