Search code examples
dataframepysparkapache-spark-sqlaws-glueaws-glue-spark

Pyspark dataframe remove duplicate in AWS Glue Script


I have a script in AWS Glue ETL Job, where it reads a S3 bucket with a lot of parquet files, do a sort by key1, key2 and a timestamp field. After that the script delete the duplicates and save a single parquet file in other S3 Bucket.

Look the data I have before the Job runs:

key1 key2 uploadTimestamp
0005541779 10 2021-12-29 14:54:08.753
0005541779 10 2021-12-29 15:06:05.968

The code that do the sort and eliminate duplicates:

#############################################################

tempDF = S3bucket_node1.toDF() #from Dynamic Frame to Data Frame

sortedDF = tempDF.orderBy(f.col("uploadTimestamp").desc(),"key1","key2").dropDuplicates(["key1","key2"]) #sort and remove duplicates


dynamicFrame = DynamicFrame.fromDF(sortedDF, glueContext, 'salesOrder') #back to Dynamic Frame

#############################################################

Get a look on this image after an order by: enter image description here

My problem: In the output file, some data got the last timestamp, some data got the first.. I can't understand why it doesnt work for all data.

Thanks.


Solution

  • It worked with the following code:

    tempDF = S3bucket_node1.toDF()
    
    w = Window.partitionBy("key1","key2").orderBy(f.desc("uploadTimestamp"))
    df = tempDF.withColumn("rn", f.row_number().over(w)).filter("rn = 1").drop("rn")
    
    dynamicFrame = DynamicFrame.fromDF(df, glueContext, 'dynamicFrame')
    

    The tip to solve that, was found here: pyspark dataframe drop duplicate values with older time stamp