Search code examples
pysparkapache-spark-sqlazure-synapseazure-data-lakedelta-lake

Azure Synapse, Data Duplication Issue in PySpark When Reading and Writing to Delta Lake


I'm working on a PySpark data pipeline that reads Salesforce User data from an Azure Data Lake Storage Gen2 "bronze" layer, transforms the data, and then writes it to a "silver" layer in Delta Lake. The problem I'm facing is that, despite using the dropDuplicates(["Id"]) function in PySpark to remove any duplicate records based on the "Id" column, I'm still encountering duplicate "Id" records in my "silver" layer. This duplication issue persists even after repeated runs of the pipeline, and it's causing significant problems in downstream analyses and processes that rely on this data.

To handle the potential issue of duplicate records, I included the dropDuplicates(["Id"]) function in my PySpark script. This function is used after the union operation, which combines the existing "silver" data with the new "bronze" data. My expectation was that this function would ensure that any duplicate records (based on the "Id" column) would be removed from the combined dataset before it is written back to the "silver" layer in Delta Lake.

Here is a snippet of the relevant code:

# Combine the bronze and silver data and deduplicate based on the "Id" column
deduplicated_df = bronze_df.union(silver_df).dropDuplicates(["Id"])

# Write the deduplicated data to the silver layer
deduplicated_df.write.format("delta").mode("overwrite").save(prod_user_silver_path)

Sample data after running pipelines 3x:

enter image description here

The result I was expecting was a "silver" layer dataset with no duplicate "Id" records. However, what I'm actually seeing is duplicate "Id" entries in the "silver" layer despite using dropDuplicates(["Id"]). This discrepancy between my expected and actual outcomes is the core issue I'm trying to resolve.

Any thoughts?


Solution

  • @jayashankarGS

    Thank you for your response and the suggestions you provided. I appreciate your help. While I did get the correct output in the dataframe after following your suggestions, I still encountered duplicates when using the data in PowerBi. However, I was able to find a solution to this issue recently.

    To resolve the problem, I modified my approach by writing the data from the Delta back to a single Parquet file, overwriting any existing data. I then stored this Parquet file in my "gold" layer. This additional step ensured that the duplicates were eliminated, and the data in my "gold" layer was clean and free of any duplicate "Id" records.

    Thank you once again for your assistance.